Download presentation
Presentation is loading. Please wait.
Published byJared Ellis Modified over 10 years ago
1
Improving the Speed and Quality of Architectural Performance Evaluation Vijay S. Pai with contributions from: Derek Schuff, Milind Kulkarni Electrical and Computer Engineering Purdue University
2
Outline Intro to Reuse Distance Analysis ▫Contributions Multicore-Aware Reuse Distance Analysis ▫Design ▫Results Sampled Parallel Reuse Distance Analysis ▫Design: Sampling, Parallelisim ▫Results ▫Application: selection of low-locality code 2
3
Reuse Distance Analysis Reuse Distance Analysis (RDA): architecture- neutral locality profile ▫Number of distinct data referenced between use and reuse of data element ▫Elements can be memory pages, disk blocks, cache blocks, etc Machine-independent model of locality ▫Predicts hit ratio in any size fully-associative LRU cache ▫Hit ratio in cache with X blocks = % of references with RD < X 3
4
Reuse Distance Analysis Applications in performance modeling, optimization ▫Multiprogramming/scheduling interaction, phase prediction ▫Cache hint generation, restructuring code, data layout 4
5
Reuse Distance Profile Example 5
6
Reuse Distance Measurement Maintain stack of all previous data addresses For each reference: ▫Search stack for referenced address ▫Depth in stack = reuse distance If not found, distance = ∞ ▫Remove from stack, push on top 6
7
Example 7 A A BA ∞ C B A C B A C B A A Address Distance BCCBA ∞∞012 B C
8
RDA Applications VM page locality [Mattson 1970] Cache performance prediction [Beyls01, Zhong03] Cache hinting [Beyls05] Code restructuring [Beyls06], data layout [Zhong04] Application performance modeling [Marin04] Phase prediction [Shen04] Visualization, manual optimization [Beyls04,05,Marin08] Modeling cache contention (multiprogramming) [Chandra05,Suh01,Fedorova05,Kim04] 8
9
Measurement Methods List-based stack algorithm is O(NM) Balanced binary trees or splay trees O(NlogM) ▫[Olken81, Sugumar93] Approximate analysis (tree compression) O(NloglogM) time and O(logM) space [Ding03] 9
10
Contributions Multicore-Aware Reuse Distance Analysis ▫First RDA to include sharing and invalidation ▫Study different invalidation timing strategies Acceleration of Multicore RDA ▫Sampling, Parallelization ▫Demonstration of application: selection of low- locality code ▫Validation against full analysis, hardware Prefetching model in RDA ▫Hybrid analysis 10
11
Outline Intro to Reuse Distance Analysis ▫Contributions Multicore-Aware Reuse Distance Analysis ▫Design ▫Results Sampled Parallel Reuse Distance Analysis ▫Design: Sampling, Parallelisim ▫Results ▫Application: selection of low-locality code 11
12
Extending RDA to Multicore RDA defined for single reference stream ▫No prior work accounts for multithreading Multicore-aware RDA accounts for invalidations and data sharing ▫Models locality of multi-threaded programs ▫Targets multicore processors with private or shared caches 12
13
Multicore Reuse Distance Invalidations cause additional misses in private caches ▫2 nd order effect: holes can be filled without eviction Sharing affects locality in shared caches ▫Inter-thread data reuse (reduces distance to shared data) ▫Capacity contention (increases distance to unshared data) 13
14
Invalidations 14 A A BA ∞ C B A C B A C (hole) Address Distance (unaware) BCCBA ∞∞01∞ B C C B A A Remote write (hole) A ∞∞∞012 B
15
Invalidation Timing Multithreaded interleaving is nondeterministic ▫If no races, invalidations can be propagated between write and next synchronization Eager invalidation – immediately at write Lazy invalidation – at next synchronization ▫Could increase reuse distance Oracular invalidation – at previous sync. ▫Data-race-free (DRF) → will not be referenced by invalidated thread ▫Could decrease reuse distance 15
16
Sharing 16 A A BA ∞ C B A C B A Address Distance (unaware) BCCBA ∞∞021 A Remote write ∞∞∞012 2 A C B B A C B A C
17
17 MCRD Results
18
18
19
19 Impact of Inaccuracy
20
20
21
Summary So Far Compared Unaware and Multicore-aware RDA to simulated caches ▫Private caches: Unaware 37% error, aware 2.5% ▫Invalidation timing had minor affect on accuracy ▫Shared caches: Unaware 76+%, aware 4.6% Made RDA viable for multithreaded workloads 21
22
Problems with Multicore RDA RDA is slow in general ▫Even efficient implementations require O(log M) time per reference Multi-threading makes it worse ▫Serialization ▫Synchronization (expensive bus-locked operations on every program reference) Goal: Fast enough to use by programmers in development cycle 22
23
Accelerating Multicore RDA Sampling Parallelization 23
24
Reuse Distance Sampling Randomly select individual references ▫Select count before sampled reference Geometric distribution, expect 1/n sampled references n = 1,000,000 ▫Fast mode until target reference is reached 24 References Fast mode
25
Reuse Distance Sampling Monitor all references until sampled address is reused (Analysis mode) ▫Track unique addresses in distance set ▫RD of the reuse reference is size of distance set Return to fast mode until next sample 25 References Fast mode Analysis mode
26
Reuse Distance Sampling Analysis mode is faster than full RDA ▫Full stack tracking not needed ▫Distance set implemented as hash table 26 References Fast mode Analysis mode
27
RD Sampling of MT Programs Data Sharing Invalidation ▫Invalidation of tracked address ▫Invalidation of address in the distance set 27
28
RD Sampling of MT programs Data Sharing ▫Analysis mode sees references from all threads ▫Reuse reference can be on any thread 28 Fast mode Analysis mode Tracking thread Remote thread
29
RD Sampling of MT programs Invalidation of tracked address ▫∞ distance 29 Fast mode Analysis mode
30
RD Sampling of MT programs Invalidation of address in distance set ▫Remove from set, increment hole count ▫New addresses “fill” holes (decrement count) 30 Fast mode Analysis mode At reuse, RD = set size + hole count
31
Parallel Measurement Goals: Get parallelism in analysis, eliminate per- ref synchronization 2 properties facilitate ▫Sampled analysis only tracks distance set, not whole stack Allows separation of state ▫Exact timing of invalidations not significant Allows delayed synchronization 31
32
Parallel Measurement Data Sharing ▫Each thread has its own distance set ▫All sets merged on reuse 32 Fast mode Analysis mode Tracking thread Remote thread At reuse, RD = set size
33
Parallel Measurement Invalidations ▫Other threads record write sets ▫On synchronization, write set contents invalidated from distance set 33 Fast mode Analysis mode Tracking thread Remote thread
34
Pruning Analysis mode stays active until reuse ▫What if address is never reused? ▫Program locality determines time spent in analysis mode Periodically prune (remove & record) the oldest sample ▫If its distance is large enough, e.g. top 1% of distances seen so far ▫Size-relative threshold allows different input sizes 34
35
Results Comparison with full analysis ▫Histograms ▫Accuracy metric Performance ▫Slowdown from native 35
36
Example RD Histograms 36 Reuse distance (64-byte blocks)
37
Example RD Histograms 37 Reuse distance (64-byte blocks) Slowdown of full analysis perturbs execution of spin- locks, inflates 0-distance bin in histogram
38
Example RD Histograms 38 Reuse distance (64-byte blocks)
39
Results: Private Stacks Error metric used by previous work: ▫Normalize histogram bins ▫Error E = ∑ i (|f i - s i |) ▫Accuracy = 1 – E / 2 91%-99% accuracy (avg 95.6%) 177x faster than full analysis 7.1x-143x slowdown from native (avg 29.6x) ▫Fast mode: 5.3x ▫80.4% of references in fast mode 39
40
Results: Shared Stacks Shared reuse distances depend on all references by other threads ▫Not just to shared data ▫Relative execution rate matters ▫More variation in measurements and in real execution Compare fully-parallel sample analysis mode to serialized sample analysis mode ▫Round-robin ensures threads progress at same rate as in non-sampled analysis 40
41
AccuracySlowdown Parallel Sampling74.1%80 Sequential Sampling88.9%265 41 FT Histogram Reuse distance (64-byte blocks)
42
Performance Comparison Single-thread sampling [Zhong08] ▫ Instrumentation 2x-4x (compiler), 4x-10x (Valgrind) ▫ Additional 10x-90x with analysis Approximate non-random sampling [Beyls04] ▫ 15x-25x (single-thread, compiler) Valgrind, our benchmarks ▫ Instrumentation 4x-75x, avg 23x ▫ Memcheck avg 97x 42
43
Low-locality PC Selection Application: Find code with poor locality to assist programmer optimization ▫e.g. n PCs account for y% of misses at cache size C Select C such that miss ratio is 10%, find enough PCs to cover 75/80/90/95% of misses Use weight-matching to compare selection against full analysis Selection accuracy 91% - 92% for private and shared caches ▫In spite of reduced accuracy in parallel-shared 43
44
Smarter Multithreaded Replacement Shared cache management is challenging ▫Benefits of demand multiplexing ▫Cost of performance interference Most work addresses multi-programming ▫Destructive interference only ▫Per-benchmark performance targets Multi-threading presents opportunities and challenges ▫Constructive interference, process performance target ▫Reuse distance profiles can help understand needs ▫Work in progress! 44
45
Conclusion Two techniques to accelerate multicore-aware reuse distance analysis ▫Sampled analysis ▫Parallel analysis ▫Private caches: 96% accuracy, 30x native ▫Shared caches: 74/89% accuracy, 80/265x native Demonstrate effectiveness for selection of code with low locality ▫91% weight-matched coverage of PCs Other applications in progress Validated against hardware caches ▫7-16% average error in miss prediction 45
46
Questions? 46
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.