RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill

1 % gcc sim.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gdb a.out gdb> run Program exited normally. gdb> % gcc para-sim.c % a.out Segmentation fault % Why Do You Need a Recorder? % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” %

2 Ideally … % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” % Long recording: small log Low runtime overhead Low cost Applicability: Programs – data race Systems – non-SC

3 Better and Better Recorders Low Cost Low Overhead Small Log Size Applicability Data RaceNon-SC InstRply ’87  YY B. & G. ’91  Y Netzer’93  Y Déjà Vu ’98  Y RecPlay ’00 JaRec ’04  FDR ’03 BugNet ’05  Y ReEnact ‘03  Y CORD ‘06  Y Strata ’06  Y RTR ‘06  YY

1 Byte/Kilo- Instruction [ASPLOS’06] A New Recorder Hardware Acceleration [ISCA’03] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical This talk covers only RTR Regulated Transitive Reduction algorithm

5 Outline Race Recording RTR Algorithm Results with Commercial Workloads Compress log during recording  replay more “regularly” Conclusion

Technically, what’s race recording?

7 Race Recording X=6 X = 1 X++ print(X) X = 1 X++ print(X) - X = X*5 - X = X*5 - Thread I Thread J OriginalReplay X=10 Recording X= 6 - X = X*5 - Log Thread I Thread J

8 Goal: Reproduce same conflicts with minimum log data Terminologies and Assumptions ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)

Regulated Transitive Reduction (RTR)

10 Log All Conflicts 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2  3 1  4 3  5 4  6 Log I : 2  3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes But too many conflicts

11 Netzer’s Transitive Reduction (TR) 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2  3 3  5 4  6 Log I : 2  3 Log Size: 64 bytes (8 integers) TR Reduced Log How to further reduce log size?

12 The Intuition of the RTR Algorithm After Reduction From I to J From J to I Vectors “Regulate” Replay

13 Stricter Dependences to Aid Vectorization 1 2 3 4 1 2 3 4 ld A Thread I Thread J Replay st B st C add st C ld B st A ld D 55 subst C 66 ld B st D Log J : 2  3 4  5 Log I : 2  3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced Fewer dependencies to log

14 Compress Vectorized Dependencies 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : x=3,5, ∆=1 Log I : x=3, ∆=1 Log Size: 40 bytes (5 integers) Vectorized Log Vector Deps. TR  RTR: fewer deps + fewer byte/dep

15 Deadlock Avoidance of RTR 1 2 3 4 5 6 1 2 3 4 5 6 ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C ld D st D Limit the strict dependencies (see paper) i:4  j:1  j:2  i:3  i:4 Replay Cycle

Results with Commercial Workloads

17 Full-system Simulation Method Commercial server hardware GEMS: http://www.cs.wisc.edu/gems Full-system (OS + application) executions 4-core CMP (Sequential Consistent) 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving L2 C0C0 C1C1 C3C3 C2C2

18 Log Size: 1 byte/KI 0.0 0.5 1.0 1.5 2.0 byte/core/KI ApacheJBBOLTPZeusAVG 0 50 100 150 200 KB/core/s ApacheJBBOLTPZeusAVG Less buffer, longer recording, smaller logs

19 RTR vs. Netzer’s TR TR RTR 0 20 40 60 80 100 ApacheJBBOLTPZeusAVG Log Size 28% smaller log TR was “optimal”

20 Why Does RTR Work Well? RTR Instructions execute at similar speed Dependencies are often “vectorizable”

A New Recorder Hardware Acceleration [ISCA’03] 1 Byte/Kilo- Instruction [ASPLOS’06] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical “Less hardware” & “TSO” not covered Equally important More details in the paper

22 Conclusion Race recording  Counter nondeterminism RTR  1 byte/kilo-instruction Based on Netzer’s transitive reduction Create stricter dependencies Vectorize dependencies to compress log Avoid overly-strict hence no deadlock Future work Support snooping, SMT, replayer

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Similar presentations

Presentation on theme: "RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Similar presentations

Presentation on theme: "RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill."— Presentation transcript:

Similar presentations

About project

Feedback