Download presentation
Presentation is loading. Please wait.
1
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University of Texas at Austin
2
2 Memory hierarchy trends Growing latency to main memory Growing cache complexity –More cache levels –New mechanisms, optimizations Growing application complexity –Lots of abstraction Application-System interactions increasingly hard to predict
3
3 The solution: More fine-grained metrics More insight within an application More rigorous comparisons across applications Potential applications: –Hardware/software tuning –Global hints for online phase detection Our approach: data structure decomposition High-level, easy to understand Highlights important access patterns
4
4 ammp vs twolf: The tale of two applications Conventional view: they’re pretty similar IPC: 0.57 vs 0.51 DL1 Miss-rate (%): 10% vs 9.5% Access patterns –Lots of pointer access in both.. –Mostly linked list traversal
5
5 ammp vs twolf: Data structure decomposition DL1 misses (%)
6
6 ammp vs twolf: Access patterns twolf t1 = b[c[i] cblock] t2 = t1 tile term t3 = n[t2 net] … i=rand() ammp atom atom=atom next atom[i] neighbour[j] ++j ++i twolf has more complex access patterns
7
7 ammp vs twolf: Phase behavior ammp twolf Time 60 billion cycles
8
8 ammp vs twolf: Phase behavior by data structure ammp twolf ammp has more interesting phase behavior
9
9 Outline Motivation Data structure decomposition Phase analysis: selecting sampling period Results: –Aggregate –Phase
10
10 Data structure decomposition Application communicates with simulator Leave core application oblivious; automatically add simulator-aware instrumentation Application Simulator Resources
11
11 DTrack Application Sources Detailed Statistics Application Executable Instrumented Sources Source Translator CompilerSimulator - DTrack’s protocol for application-simulator communication
12
12 DTrack’s protocol 1.Application stores mapping at a predetermined shared location –(start address, end address) → variable name 2...and signals simulator by special opcode Other techniques possible 3.Simulator detects signal, reads shared location 1.Application stores mapping at a predetermined shared location –(start address, end address) → variable name 2...and signals simulator by special opcode Other techniques possible 3.Simulator detects signal, reads shared location Simulator now knows variable names of address regions
13
13 Instrumentation without perturbance Global segment: write to file –Expensive, but one-time cost during initialization –Amortized across all global variables Heap: save in special variables after every malloc/free –Overhead α frequency of mallocs/frees –Special variables always hit in cache Stack: no instrumentation –Function calls too frequent –Causes negligible misses anyway
14
14 Measuring perturbance Communicate specific start and end points in application to simulator Compare instruction counts between them with and without instrumentation ΔInstruction count <4% even with frequent malloc
15
15 Outline Motivation Data structure decomposition Phase analysis: selecting sampling period Results: –Aggregate –Phase
16
16 The importance of sampling period Good sampling period Low noise DL1 misses/ 10M cycles (thousands) DL1 misses/ 230M cycles (thousands)
17
17 Volatility: A noise metric for time sequence graphs Raw data stream Volatility value Volatility graph Miss graph Aggregate for some Sampling period Point volatilities Sort, extract 90 th percentile Point volatility = abs(X t -X t-1 ) max(X t, X t-1 )
18
18 Volatility depends on sampling period Raw data stream Volatility value Volatility graph sampling Period Aggregate Point volatilities
19
19 Volatility profile: Volatility vs sampling period Sampling period (millions of samples) Volatility 164.gzip
20
20 Outline Motivation Data structure decomposition Phase analysis: selecting sampling period Results: –Aggregate –Phase
21
21 Methodology A Source translator: C-Breeze B Compiler: Alpha GEM cc C Simulator: sim-alpha –Validated model of 21264 pipeline Simulated machine: Alpha 21264 –4-way issue, 64KB 3-cycle DL1 Benchmarks: 12 C applications from SPEC CPU2000 suite ABC
22
22 Major data structures by DL1 misses % DL1 misses
23
23 Most misses ≣ Most pipeline stalls? Process: –Detect stall cycles when no instructions were committed –Assign blame to data structure of oldest instruction in pipeline Results –Stall cycle ranks track miss count ranks –Exceptions: tds in 179.art search in 186.crafty
24
24 Types of phase behavior DL1 Misses (Millions) I. mcf II. art 115 billion cycles
25
25 DL1 Misses (Millions) III. mesa Types of phase behavior cycles
26
26 Summary More detailed metrics richer application comparison Low-overhead data structure decomposition Determining ideal sampling period –A volatility metric inspired by spectral analysis Ideal sampling period is application-specific Data structures in an application share common phase boundaries
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.