Download presentation
Presentation is loading. Please wait.
Published byCheyanne Hancey Modified over 10 years ago
1
Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors: Mark Hill, Rastislav Bodik Committee: Remzi Arpaci-Dusseau, Mikko Lipasti, Barton Miller, David Wood
2
1 Overview Increasingly useful to replay multithreaded codereplaymultithreaded Race recording: key to dealing with nondeterminism A Case Study Long recording: 1 byte/kilo-instr Always-on recording: less than 2% overhead Low cost: 24 KB RAM/core Support both SC & TSO (x86-like) EffectiveInexpensive Race Recorder Long Recording More Applicable Low Overhead Low Cost
3
2 Order-Value Hybrid RTR Algorithm Thesis Contributions Set/LRU Approximation Coherence Piggyback EffectiveInexpensive Low Cost Hardware Small Log Size Low Runtime Overhead SC & TSO Applicability
4
3 Outline Motivation & Problem An Effective and Inexpensive Race Recorder Evaluation Method & Results RTR Algorithm Set/LRU Approximation Coherence Piggyback Order-Value Hybrid Conclusion & My Other Research 5 slides 21 6 3
5
Motivation & Problem
6
5 Multithreaded Debugging % gcc hash.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gdb a.out gdb> run Program exited normally. gdb> % gcc para-hash.c % a.out Segmentation fault % % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” %
7
6 Race Recording X=6 X = 1 X++ print(X) X = 1 X++ print(X) - X = X*5 - X = X*5 - Thread I Thread J OriginalReplay X=10 Recording X= 6 - X = X*5 - Log Thread I Thread J
8
7 Recording for Multithreaded Replay Race Recording Not-an-issue for a single thread Create the same general & data races Checkpointing Provide a snapshot of the program state Many proposals (e.g., SafetyNet), not focus Input Recording Provide repeatable inputs Some proposals (e.g., part of FDR), not focus Focus
9
8 A Good Race Recorder % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” % Long recording: small log Low runtime overhead Low cost Applicability
10
9 Desired & Existing Race Recorders Recording Length ApplicabilityOverheadCost Desired Recorder Small Log Size MP Racey Code SC TSO Negligible Slowdown Little Hardware InstRply ’87 R&C ’90 Bacon’91 Netzer’93 Déjà Vu ’98 RecPlay ’00 JaRec ’04 Our Recorder
11
Order-Value Hybrid Set/LRU Approximation RTR Algorithm Coherence Piggyback Small Log Size
12
11 Reproduce exact same conflicts: no more, no less Problem Formulation ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)
13
12 Detect conflicts Write log Log All Conflicts 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2 3 1 4 3 5 4 6 Log I : 2 3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes Assign IC (logical Timestamps) But too many conflicts
14
13 Netzer’s Transitive Reduction 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2 3 3 5 4 6 Log I : 2 3 Log Size: 64 bytes (8 integers) TR Reduced Log
15
14 The Intuition of the New RTR Algorithm After Reduction From I to J From J to I Vectors Regulate Replay (RTR)
16
15 Stricter Dependences to Aid Vectorization 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2 3 4 5 Log I : 2 3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced
17
16 Compress Vectorized Dependencies 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : x=3,5, ∆=1 Log I : x=3, ∆=1 Log Size: 40 bytes (5 integers) Vectorized Log Vector Deps. Reduce log size to KB/core/second
18
Order-Value Hybrid Set/LRU Approximation RTR Algorithm Coherence Piggyback Low Runtime Overhead
19
18 Detect Conflicts 1 2 3 1 2 3 4 ld A Thread I Thread J Recording st B st C add st C ld B st A A.readers.add( I, 1) if (C.writer != I ) log(WAW) foreach C.readers if (reader != I ) log(WAR) C.readers.clear( ) C.writer = ( I, 3) B.writer = ( I, 2)C.writer =( J, 2) if (B.writer != J ) log(RAW) B.readers.add( J,3) … Expensive in software A.readers A.writer
20
19 Use Cache and Cache Coherence Proc I Tag State Data Timestamp A S … 1 B M … 4 Proc J Tag State Data Timestamp A S … 3 B I … 2 A.readers A.writer B.readers B.writer ld B Get/S Request Data Response Timestamp Detect conflict in hardware with little runtime cost RAW Detected & Logged
21
20 Cache Evictions and Writebacks Proc I Tag State Data Timestamp A S … 1 B M … 4 Proc J Tag State Data Timestamp A S … 3 B I … 2 st A OK with nonsilent eviction & directory evictionnonsilent eviction & directory eviction C M … 3 Directory of A: Shared(I,J) Owner() Get/S Inv Ack Timestamp? WAR Detected & Logged M … 4
22
21 Implement TR and RTR in Hardware Ideal TR requires vector timestamps Too expensive New idea: Pairwise-TR (use scalar timestamp) Enable pairwise transitive reduction Optimal RTR algorithm is likely expensive Implement a greedy RTR algorithm One-pass, online algorithm Keep a sliding window of vectorizable dependencies
23
22 Hardware Implementation Cache Eviction/writebackSolved, more details later Directory protocolsSolved Snooping protocolsPartly solved Two-level coherenceNot yet solved Processor Out-of-order/PrefetchingSolved Unordered messageSolved Counter overflowSolved Thread MigrationNot yet solved
24
Order-Value Hybrid Set/LRU Approximation RTR Algorithm Coherence Piggyback Low Cost Hardware
25
24 Timestamp Approximation Tag State Data Timestamp A S … 1 B M … 2 One Set of I’s $ Correct, but more evictions more logged conflicts 1 2 3 1 2 3 J ld A Thread I Thread J Recording st B st C add st C ld B st A I ld D Use current IC of thread I C M … 3 Directory of A: Shared(I)
26
Hardware Cost Log Size
27
26 Tag State Data Timestamp A S … 1 B M … 2 One Set of I’s $ 1 2 3 1 2 3 J ld A Thread I Thread J st B st C add st C ld B st A I ld D C M … 3 Recording Set/ LRU Approximation Use current IC of thread I LRU guarantee B’s TS > A’s TS Set/LRU better preserve reducibility Small $ more misses but still small log
28
27 Hardware Cost of Timestamps Coupled timestamp memory: overhead cache size Not flexible 64B line + 64b (24b) timestamp 12.5% (4.7%) overhead 192 KB for a 4MB L2 Need to modify cache Tag State Data Timestamp A S … 1 B M … 2 Coupled Timestamp Memory
29
28 Decoupled Timestamp Memory Decoupling Small timestamp memory (Set/LRU) e.g., 32-set, 64-way 99% transitive reduction Timestamps Memory 24 KB No need to modify cache Tag State Data Timestamp A S … 1 B M … 2 Tag State Data A S … B M … Tag Timestamp A 1 B 2 Cache Timestamp Memory Coupled Timestamp Memory From 192 KB to 24 KB: 8x reduction
30
29 Order-Value Hybrid Set/LRU Approximation RTR Algorithm Coherence Piggyback SC & TSO Applicability
31
30 ld A ld B st A,1 st B,1 ld A ld B st A,1 st B,1 ld A ld B st A,1 st B,1 A=1 B=0 A=0 B=1 A=1 B=1 Recording with Total Store Order (TSO) Majority of existing MP are non-SC TSO is well defined, x86-like 1 2 1 2 st A,1 Thread I Thread J ld B st B,1 ld A A=B=0 ld A ld B st A,1 st B,1 A=0 B=0 SC TSO
32
31 TSO Execution 1 2 1 2 st A,1 Thread I Thread J ld B st B,1 ld A A=B=0 ld A ld B st A,1 st B,1 A=0 B=0 st A,1 st B,1 I WrBuf Memory System J WrBuf A=0B=0A=0B=0 A=1B=1
33
32 Order-Value-Hybrid Recording 1 2 1 2 st A,1 Thread I Thread J ld B st B,1 ld A Recording A=B=0 1 2 1 2 st A,1 Thread I Thread J ld B st B,1 ld A Replay Value Used A=0 ld A ld B st A,1 st B,1 A=0 B=0 st A,1 st B,1 I WrBuf Memory System J WrBuf A=0B=0 WAR Omitted Value Logged A=0B=0 A=1B=1 Start Monitor A Start Monitor B A Changed! Stop Monitor B
34
33 Hybrid Recording with TR and RTR Hybrid recording All loads get correct values Hardware similar to OoO SC [Gharachorloo et al. ’91] Hybrid + TR & RTR TR will not use the omitted WAR in reduction RTR vectorize dependencies more conservatively
35
Evaluation Method & Results
36
35 Put-it-together: Determinizer/ CMP Shared L2 Cache (L1 Dir) TSM Core 1 Core 2 Core 4 Core 3 L1_I$L1_D$ TSM IC L1 Coherence Controller Log TR Reg RTR Reg
37
36 Simulation Method Commercial server hardware GEMS: http://www.cs.wisc.edu/gems Full-system (OS + application) executions 4-core CMP (Sequential Consistent) 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving
38
37 Log Size: 1 byte/kilo-instr Well within in the capability of current machines Long recording (days – months) need improvement 0.0 0.5 1.0 1.5 2.0 byte/core/kilo-instr ApacheJBBOLTPZeusAVG 0 50 100 150 200 KB/core/s ApacheJBBOLTPZeusAVG
39
38 Runtime Overhead BaselineWith race recorder 0 20 40 60 80 100 Execution Time ApacheJBBOLTPZeus Interconnection Msg. B/W Our recorder can be “always-on” 0 80 100 ApacheJBBOLTPZeus 60 40 20
40
39 Benefits of RTR and Set/LRU (Log Size) Pairwise-TR Our RTR Improvement by RTR 0 20 40 60 80 100 ApacheJBBOLTPZeusAVG Perfect TSM 24KB Set/LRU TSM Effectiveness of Set/ LRU 0 20 40 60 80 100 ApacheJBBOLTPZeusAVG Log Size
41
40 Why RTR and Set/LRU Work Well? RTR Processors execute instructions at similar speed Therefore, we can find “vectorizable” dependencies Set/LRU Temporal locality makes the LRU timestamps old We only need to know if a timestamp is “old-enough”
42
41 Sensitivity and Scalability A design space of the timestamp memory (TSM) Size: smaller TSM -> larger logSize Read/write timestamp: should be used when TSM is largeRead/write timestamp Partial timestamp: 24-bit enoughPartial timestamp Associativity: higher better for RTRAssociativity ScalabilityScalability of the recorder Studied with modest processors (2p – 16p) Commercial workloads, not scientific workloads Log size increase slowly with number of cores
43
Conclusion & My Other Research
44
43 Race Recording Race recording Key to combat nondeterminism My thesis An effective & inexpensive Recorder RTR algorithm small log size Coherence piggyback Negligible slowdown Timestamp approximation Low hardware cost Order-value hybrid support SC & TSO Future work Improve race recording algorithm Improve race recorder implementation Study race replay
45
44 Serializability Violation Detector [PLDI’05] Like a race detector No a priori annotation requirement “critical sections” are inferred Intend to detect bugs “actually” happen Check for a 2-Phase-Locking condition Read in1 Read in2 Write out1 Write out2 Write local Read local Shared Variables A “Critical Section”
46
45 Publications FDR (ISCA’03) Adopted by UCSD BugNet (ISCA’05) SVD (PLDI’05) Cited by Vaziri et al. (POPL’06) Influenced new data race definition RTR, Set/LRU & Hybrid Submitted for publication
47
Thank you! % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-hash.c % a.out Segmentation fault Race recorded in “log” %
48
47 Acknowledgements Joint work with my advisors Mark Hill, Ras Bodik Ph.D. Committee David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau, Barton Miller Multifacet Group Milo Martin, Dan Sorin, Carl Mauer, Brad Beckmann, Kevin Moore, Alaa Alameldeen, Mike Marty, Luke Yen Affiliates & Companies Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric Bach, Gang Luo, Alex Chow, IBM, Intel, Microsoft, Sun
49
48 Deterministic Replay is Useful Deterministic Replay is logically recreating a program execution Present applications Cyclic Debugging ( [Pancake & Netzer ‘93] ) Fault Tolerance (ExtraVirt [ Lucchetti et al. ’05 ] ) Intrusion Analysis (ReVirt [Dunlap et al. ’02]) Future applications Data Recovery Replay-based Synchronization
50
49 Multicore and Multithreading Multicore is common AMD X2 IBM Power 5/6, Cell Intel Pentium D, Core Duo Sun S PARC T1 Multithreading is common Server: high throughput Scientific: high performance Desktop/embedded: low response time
51
50 Race Recording: Key to Determinism Races: general race & data race [Netzer & Miller] Both cause nondeterminism Race recording can help, but Existing race recorders are inadequate Some generate large logs Some have high runtime overhead Some have high hardware cost (space overhead) Support only sequential consistency Need a better race recorder
52
51 Recording/Replay & Debugging Online Recorder Crash Dump “Core” P1 P2 P3 P4 Checkpoint BCheckpoint C Store log AStore log BStore log C Checkpoint A Crash Read Checkpoint B Replaying from log B, C Deterministic Replayer
53
52 Deterministic Replay & Fault Tolerance Fault Recovery Replay after a failure Fault Detection Replay then compare (Courtesy of VMware)
54
53 Future: Record/Replay & Undo/Redo VM as a software platform Ease software development Fine granularity in Undo and Redo Windows XP
55
54 Future: Replay-based Synchronization Three steps Coarse-grain sync. fine-grain sync. hardware sync. Results: higher performance Works only if static control flow & fixed data addr DSP kernels ld A st B Unlock() lock() st A ld B Recording ld A st B st A ld B Replay Log
56
55 Race Recording Related Work Total-order recordersPartial-order recorders Bacon ’91 (Hardware) RecPlay ’00 JaRec ’04 R&C’90 Déjà Vu ’98 Bacon ’91 (Hardware) Instant Replay ’87Netzer ’93 Bus transactions Lamport ClocksScheduling Bus transaction groups Variable versionVector clocks Large logSmall log Large log Small log Low overhead (sync only) Low overhead (non-MP) Low overhead High overhead Low replay parallelismHigh replay parallelism
57
56 Correctness of Order-Value-Hybrid Removing WAR dependencies Say thread I read, thread J write Removing the WAR affects I’s read, not J’s write But, for every dependence removed, thread I reads correct value from the value log Therefore, all reads get the correct value
58
57 TR and TSO TR affects dependencies reduced by a WAR The WAR itself may later be removed during replay Solution: Not use WAR in TR if the WAR can be removed Respond with a special flag when a loaded cache line is stolen 1 2 1 2 st A Thread I Thread J st C st B st C Recording 33 ld B ld A Must not be reduced
59
58 RTR and TSO The sliding window may expose the ordered loads Shrink the sliding window to avoid it 1 2 1 2 st A Thread I Thread J add sub Recording 33 st B ld A 44 ld C ld B ordered in write bufffer ordered new win for j:3 old win for j:3 Not allowed by new window
60
59 Deadlock Avoidance of RTR 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C ld D st D Avoid deadlock by adhere to a SC total order i:4 j:1 j:2 i:3 i:4 Replay Cycle
61
60 Recording Race-free Executions No data races Only need to record synchronization race Deterministic replay up until the first data race
62
61 Replay Parallelism Replay performance depends on (1)Number of synchronizations (2)Extra wait incurred by the synchronizations
63
62 Directory Protocols Add sticky states in the directory Retain states after writebacks Need extra acknowledgements Or, add extra timestamp memory in the directory Helps to avoid extra acknowledgements A tradeoff Sticky states can be cheaper But extra timestamp memory can be faster
64
63 Snooping Protocols Key problem is combined/implicit response Not a problem for AMD Hammer Proc I Tag State Data Timestamp A S … 1 B M … 4 Proc J Tag State Data Timestamp A S … 3 B I … 2 st A Get/X Pull Shared WAR Detected & Logged + Current IC
65
64 Nonsilent Evictions Proc I Tag State Data Timestamp A S … 1 B M … 4 Proc J Tag State Data Timestamp A S … 3 B I … 2 st A Directory eviction: more false conflict, like snooping C M … 3 Directory of A: Shared(J) Owner() StickyS(I,J) Get/S M … 4 Ack Timestamp Memory Eviction
66
65 Out-of-Order & Hardware Prefetching Speculative execution No IC assigned yet Hardware prefetching No IC assigned Key idea: receive observation Can associate a ld/st with current commit instruction
67
66 Unordered Messages in Interconnect Message arrive out-of-order Can affect reduction But better add a sequence number Reconstruct the message order Enable IC compression by sending deltas
68
67 Integer Overflow IC and timestamps may overflow IC: make it 64bit, will not overflow for a long time Timestamps: use approximation techniques MSB of IC + LSB of Timestamps
69
68 Varying TSM Size
70
69 Varying Associativity
71
70 Varying Partial Timestamp Width
72
71 Log Size Scaling 24816 Number of Cores 0.0 0.2 0.4 0.6 0.8 1.0 Log Size (MB/core/s) Apache SPECjbb OLTP Zeus
73
72 In Retrospect … What are you most proud of? RTR improves TR after 13 years What would you do differently if doing it again? “replaying me is deterministic” (just kidding) I wish I focused on race recording earlier What the industry should do? Implement the recorder as a VMM extension
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.