Download presentation
Presentation is loading. Please wait.
Published byErnest Bramley Modified over 10 years ago
1
RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill
2
1 % gcc sim.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gdb a.out gdb> run Program exited normally. gdb> % gcc para-sim.c % a.out Segmentation fault % Why Do You Need a Recorder? % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” %
3
2 Ideally … % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” % Long recording: small log Low runtime overhead Low cost Applicability: Programs – data race Systems – non-SC
4
3 Better and Better Recorders Low Cost Low Overhead Small Log Size Applicability Data RaceNon-SC InstRply ’87 YY B. & G. ’91 Y Netzer’93 Y Déjà Vu ’98 Y RecPlay ’00 JaRec ’04 FDR ’03 BugNet ’05 Y ReEnact ‘03 Y CORD ‘06 Y Strata ’06 Y RTR ‘06 YY
5
1 Byte/Kilo- Instruction [ASPLOS’06] A New Recorder Hardware Acceleration [ISCA’03] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical This talk covers only RTR Regulated Transitive Reduction algorithm
6
5 Outline Race Recording RTR Algorithm Results with Commercial Workloads Compress log during recording replay more “regularly” Conclusion
7
Technically, what’s race recording?
8
7 Race Recording X=6 X = 1 X++ print(X) X = 1 X++ print(X) - X = X*5 - X = X*5 - Thread I Thread J OriginalReplay X=10 Recording X= 6 - X = X*5 - Log Thread I Thread J
9
8 Goal: Reproduce same conflicts with minimum log data Terminologies and Assumptions ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)
10
Regulated Transitive Reduction (RTR)
11
10 Log All Conflicts 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2 3 1 4 3 5 4 6 Log I : 2 3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes But too many conflicts
12
11 Netzer’s Transitive Reduction (TR) 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2 3 3 5 4 6 Log I : 2 3 Log Size: 64 bytes (8 integers) TR Reduced Log How to further reduce log size?
13
12 The Intuition of the RTR Algorithm After Reduction From I to J From J to I Vectors “Regulate” Replay
14
13 Stricter Dependences to Aid Vectorization 1 2 3 4 1 2 3 4 ld A Thread I Thread J Replay st B st C add st C ld B st A ld D 55 subst C 66 ld B st D Log J : 2 3 4 5 Log I : 2 3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced Fewer dependencies to log
15
14 Compress Vectorized Dependencies 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : x=3,5, ∆=1 Log I : x=3, ∆=1 Log Size: 40 bytes (5 integers) Vectorized Log Vector Deps. TR RTR: fewer deps + fewer byte/dep
16
15 Deadlock Avoidance of RTR 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C ld D st D Limit the strict dependencies (see paper) i:4 j:1 j:2 i:3 i:4 Replay Cycle
17
Results with Commercial Workloads
18
17 Full-system Simulation Method Commercial server hardware GEMS: http://www.cs.wisc.edu/gems Full-system (OS + application) executions 4-core CMP (Sequential Consistent) 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving L2 C0C0 C1C1 C3C3 C2C2
19
18 Log Size: 1 byte/KI 0.0 0.5 1.0 1.5 2.0 byte/core/KI ApacheJBBOLTPZeusAVG 0 50 100 150 200 KB/core/s ApacheJBBOLTPZeusAVG Less buffer, longer recording, smaller logs
20
19 RTR vs. Netzer’s TR TR RTR 0 20 40 60 80 100 ApacheJBBOLTPZeusAVG Log Size 28% smaller log TR was “optimal”
21
20 Why Does RTR Work Well? RTR Instructions execute at similar speed Dependencies are often “vectorizable”
22
A New Recorder Hardware Acceleration [ISCA’03] 1 Byte/Kilo- Instruction [ASPLOS’06] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical “Less hardware” & “TSO” not covered Equally important More details in the paper
23
22 Conclusion Race recording Counter nondeterminism RTR 1 byte/kilo-instruction Based on Netzer’s transitive reduction Create stricter dependencies Vectorize dependencies to compress log Avoid overly-strict hence no deadlock Future work Support snooping, SMT, replayer
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.