Download presentation
Presentation is loading. Please wait.
Published byShawn Charles Modified over 9 years ago
1
Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich
2
Oct 20052 Introduction & Motivation Dynamic compilers common execution platform for OO languages (Java, C#) Properties of OO programs difficult to analyze at compile-time JIT compiler can immediately use information obtained at run-time
3
Oct 20053 Introduction & Motivation Types of information: 1.Profiles: e.g. execution frequency of methods / basic blocks 2.Hardware-specific properties: cache misses, TLB misses, branch prediction failures
4
Oct 20054 Outline 1.Introduction 2.Requirements 3.Related work 4.Implementation 5.Results 6.Conclusions
5
Oct 20055 Requirements Infrastructure flexible enough to measure different execution metrics –Hide machine-specific details from VM –Keep changes to the VM/compiler minimal Runtime overhead of collecting information from the CPU low Information must be precise to be useful for online optimization
6
Oct 20056 Related work Profile guided optimization –Code positioning [PettisPLDI90] Hardware performance monitors –Relating HPM data to basic blocks [Ammons PLDI97] –“Vertical profiling” [Hauswirth OOPSLA 2004] Dynamic optimization –Mississippi delta [Adl-Tabatabai PLDI2004] –Object reordering [Huang OOPSLA 2004] Our work: –No instrumentation –Use profile data + hardware info –Targets fully automatic dynamic optimization
7
Oct 20057 Hardware performance monitors Sampling-based counting –CPU reports state every n events –Precision platform-dependent (pipelines, out-of-order execution) Sampling provides method, basic block, or instruction-level information –Newer CPUs support precise sampling (e.g. P4, Itanium)
8
Oct 20058 Hardware performance monitors Way to localize performance bottlenecks –Sampling interval determines how fine- grained the information is Smaller sampling interval more data –Trade-off: precision vs. runtime overhead –Need enough samples for a representative picture of the program behavior
9
Oct 20059 Implementation Main parts 1.Kernel module: low level access to hardware, per process counting 2.User-space library: hides kernel & device driver details from VM 3.Java VM thread: collects samples periodically, maps samples to Java code –Implemented on top of Jikes RVM
10
Oct 200510 System overview
11
Oct 200511 Implementation Supported events: –L1 and L2 cache misses –DTLB misses –Branch prediction Parameters of the monitoring module: –Buffer size (fixed) –Polling interval (fixed) –Sampling interval (adaptive) Keep runtime overhead constant by changing interval during run-time automatically
12
Oct 200512 From raw data to Java Determine method + bytecode instr –Build sorted method table –Map offset to bytecode 0x080485e1: mov 0x4(%esi),%esi 0x080485e4: mov $0x4,%edi 0x080485e9: mov (%esi,%edi,4),%esi 0x080485ec: mov %ebx,0x4(%esi) 0x080485ef: mov $0x4,%ebx 0x080485f4: push %ebx 0x080485f5: mov $0x0,%ebx 0x080485fa: push %ebx 0x080485fb: mov 0x8(%ebp),%ebx 0x080485fe: push %ebx 0x080485ff: mov (%ebx),%ebx 0x08048601: call *0x4(%ebx) 0x08048604: add $0xc,%esp 0x08048607: mov 0x8(%ebp),%ebx 0x0804860a: mov 0x4(%ebx),%ebx GETFIELD ARRAYLOAD INVOKEVIRTUAL
13
Oct 200513 From raw data to Java Sample gives PC + register contents PC machine code compiled Java code bytecode instruction For data address: use registers + machine code to calculate target address: –GETFIELD indirect load mov 12(eax), eax // 12 = offset of field
14
Oct 200514 Engineering issues Lookup of PC to get method / BC instr must be efficient –Done in parallel with user program –Use binary search / hash table –Update at recompilation, GC Identify 100% of instructions (PCs): –Include samples from application, VM, and library code –Dealing with native parts
15
Oct 200515 Infrastructure Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform Pentium 4, 3 GHz, 1G RAM, 1M L2 cache Measured data show: –Runtime overhead –Extraction of meaningful information
16
Oct 200516 Runtime overhead ProgramOrig [sec], [score] Sampling interval 10000 Sampling interval 1000 javac7.182.0%2.4% raytrace4.042.4%2.0% jess2.930.6%0.1% jack2.733.5%2.7% db10.490.1%3.1% compress6.500.9%1.5% mpegaudio6.541.3%0.3% jbb6209.672.4%4.6% average1.6%2.1% Experiment setup: monitor L2 cache misses
17
Oct 200517 Runtime overhead: specJBB Total cost / sample: ~ 3000 cycles
18
Oct 200518 Measurements Measure which instructions produce most events (cache misses, branch mispred) –Potential for data locality and control flow optimizations Compare different spec-benchmarks –Find “hot spots”: instructions that produce 80% of all measured events
19
Oct 200519 L1/L2 Cache misses 80% quantile = 21 instructions (N=571) 80% quantile = 13 (N=295)
20
Oct 200520 L1/L2 Cache misses 80% quantile = 76 (N=2361) 80% quantile = 477 (N=8526)
21
Oct 200521 L1/L2 Cache misses 80% quantile = 1296 (N=3172) 80% quantile = 153 (N=672)
22
Oct 200522 Branch prediction 80% quantile = 307 (N=4193) 80% quantile = 1575 (N=7478)
23
Oct 200523 Summary 80%-quantile in % of totalL1 missesL2 missesBranch pred. specJBB5.6%3.2%7.3% javac40.9%22.7%21.1% db3.7%4.4%0.8% Distribution of events over program differ significantly between benchmarks Challenge: Are data precise enough to guide optimizations in a dynamic compiler?
24
Oct 200524 Further work Apply information in optimizer –Data: access path expressions p.x.y –Control flow: inlining, I-cache locality Investigate flexible sampling interval Further optimizations of monitoring system –Replacing expensive JNI calls –Avoid copying of samples
25
Oct 200525 Concluding remarks Precise performance event monitoring is possible with low overhead (~ 2%) Monitoring infrastructure tied into Jikes RVM compiler Instruction level information allows optimizations to focus on “hot spots” Good platform to study coupling compiler decisions to hardware-specific platform properties
26
Oct 200526 Backup: P4 performance counters P4 stores samples in OS supplied buffer –Interrupt only generated when buffer is filled –Lower runtime overhead All registers are stored together with IP –Possible to obtain data address profiles Only subset of events available for PEBS –Future architectures may support more EAXEBXECXEDXESIEDIEBPESPEIP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.