Download presentation
Presentation is loading. Please wait.
Published byShayna Toler Modified over 9 years ago
1
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010
2
Shared-memory programs are hard to debug Due to non-deterministic memory races Memory races depend on thread interleaving ▪ Read/write by thread A + write by thread B to same location Deterministic replay Check-point initial program state at recording start Record races in a log Enforce same race ordering at replay Race recording provides repeatability 2Gwendolyn Voskuilen et al.
3
Record predecessor-successor ordering of threads involved in a memory race Races always involve a write leverage coherence ▪ Global event (e.g., write invalidation) for memory races Captures all races – synchronization and data Two key overheads Log size Hardware to track race ordering 3Gwendolyn Voskuilen et al.
4
Centralized - Strata [ASPLOS06], DeLorean [ISCA08] Logging/ordering at a central entity DeLorean has shorter log but Strata uses less hardware Both less scalable Distributed - FDR [ISCA03], RTR [ASPLOS06], Rerun [ISCA08] Use Lamport clocks with directory coherence All exploit transitivity to reduce logs ▪ Avoid recording races made redundant by transitivity Rerun significantly reduces hardware Our focus – distributed schemes 4Gwendolyn Voskuilen et al.
5
Goal: further reduce log size with minimal hardware Rerun logs 38 GB/hour on 16 2-GHz cores Our key novelty: Exploit acyclicity of races Previous schemes record all non-transitive races Timetraveler records only cyclic, non-transitive races 5Gwendolyn Voskuilen et al.
6
Two novel and elegant mechanisms Post-dating : correctly orders acyclic races and detects cyclic races via L1 & L2 ▪ No messy cycle detection hardware (just a 32-bit timestamp/core) Time-delay buffers: avoids false cycles through L2 Reduce log by 8x (commercial) & 123x (scientific) over Rerun Minimal hardware: 2 32-bit timestamps/core + 696-byte time- delay 696 MB/hour on 16 2-GHz cores Timetraveler significantly reduces log with minimal, elegant hardware 6Gwendolyn Voskuilen et al.
7
Introduction Timetraveler operations Rerun background Post-dating Time-delay buffer Results Conclusion Gwendolyn Voskuilen et al.7
8
Rerun eliminates per-block timestamps in L1 and L2 needs only one timestamp per core/L2 bank Rerun divides thread into atomic sections (episodes) Ends episode at a race; successor’s timestamp = predecessor timestamp+1 (piggybacked on coherence message) Logs length and timestamp of episode In replay, the serial order of episodes is known Races fall in two categories [Strata]: Current – block last accessed in another thread’s current episode Past – block last accessed in a past episode Distinguished by R/W bit per block (or Bloom filter) Past races are implied by transitivity, need not be logged 8Gwendolyn Voskuilen et al.
9
9 23 20 Timestamp: Dynamic Execution A? B? A? 25 24 26 27 Gwendolyn Voskuilen et al. Episodes: 2 2 log entries (A,B)(A,B) 24 26
10
Timetraveler logs only current, cyclic races Rerun logs all current races Post-dating Upon current race, predecessor gives post-dated timestamp to successor, guarantees not to exceed it due to future races ▪ Without ending ▪ Breathing room for predecessor to avoid ending immediately ▪ Correctly orders acyclic successor ▪ Detects cycles causing post-dated timestamp to be exceeded Minimal hardware over Rerun 10Gwendolyn Voskuilen et al. Postdating exploits acyclicity & detects cycles with minimal hardware
11
11 23 20 Current TS: Post-dated TS: Gwendolyn Voskuilen et al. 1 chapter --- Dynamic Execution A? B? A? (A,B)(A,B) 33 34 44 --- 45 23 20 Timestamp: Dynamic Execution A? B? A? 25 24 26 27 (A,B)(A,B) 28 33 44 RerunTimetraveler 2 episodes
12
Rerun conservatively ends episodes upon replacements/downgrades of current blocks to L2 Places timestamp at L2 for successors Orders racing successor after predecessor Timetraveler employs post-dating to avoid ending Places post-dated timestamp at L2 Postdating extends chapters beyond replacements 12Gwendolyn Voskuilen et al.
13
Problem: Only one timestamp per L2 bank All blocks look recent, even if only a single block recently accessed and others accessed long ago Causes false cycles when accessing one of the others ▪ L2 timestamp > thread’s post-dated timestamp cycle Solution: Buffer most-recently arrived timestamps at L2 Delays update of L2 timestamp so L2 bank retains old timestamp L2 timestamp < thread’s post-dated timestamp no cycle Requests get data from L2, timestamp from buffer or L2 8 entries per L2 bank suffice Time-delay buffer avoids false cycles through L2 13Gwendolyn Voskuilen et al.
14
Introduction Timetraveler operations Rerun background Post-dating Time-delay buffer Results Conclusion Gwendolyn Voskuilen et al.14
15
GEMS + Simics 8 in-order cores, MESI coherence 32 KB split I & D, 8 MB 8 bank L2 Workloads Commercial: Apache, OLTP, SpecJBB 2005 Scientific: SPLASH Ocean, Raytrace, Water-nsquared Timetraveler R/W bits per L1 block, 8-entry time-delay buffer per L2 bank, 32-bit timestamps, 16-bit chapter length, postdating offest = 10 Rerun R/W bloom filters, 32-bit timestamps, 16-bit episode length 15Gwendolyn Voskuilen et al.
16
16 8x 123x Gwendolyn Voskuilen et al. Large reduction in log growth due to post-dating Post-dating & time-delay buffer effectively capture true cycles
17
Benchmarks Current races Current-block replacements Total current- races per chapter Current-racesNon-races Specjbb0.61.121.01.7 Apache1.58.026.19.5 OLTP3.45.812.29.3 Water-n 2 2.36.4228.28.7 Ocean1.82.45.14.1 Raytrace2.43.9197.86.3 Mean-com1.84.919.86.8 Mean-sci2.14.2143.76.4 17 Multiple races per chapter Ending on current-block replacements would significantly shorten chapters Gwendolyn Voskuilen et al.
18
Timetraveler exploits acyclicity of races to reduce log size 8X (commercial) & 123X (scientific) reduction over Rerun Two novel techniques elegantly exploit and detect cycles Post-dating Time-delay buffer Introduces minimal hardware Two timestamps per core 696 byte time-delay buffer 18Gwendolyn Voskuilen et al. CMPs on the rise + debugging important Timetraveler valuable
19
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010
20
Two requirements for replay All original races must occur in replay No new races (not seen originally) may occur Replay need not be terribly fast but cannot be terribly slow Thus simplest scheme is sequential replay of chapters Can leverage speculation for faster replay Gwendolyn Voskuilen et al.20
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.