Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich

Oct 20052 Introduction & Motivation Dynamic compilers common execution platform for OO languages (Java, C#) Properties of OO programs difficult to analyze at compile-time JIT compiler can immediately use information obtained at run-time

Oct 20053 Introduction & Motivation Types of information: 1.Profiles: e.g. execution frequency of methods / basic blocks 2.Hardware-specific properties: cache misses, TLB misses, branch prediction failures

Oct 20054 Outline 1.Introduction 2.Requirements 3.Related work 4.Implementation 5.Results 6.Conclusions

Oct 20055 Requirements Infrastructure flexible enough to measure different execution metrics –Hide machine-specific details from VM –Keep changes to the VM/compiler minimal Runtime overhead of collecting information from the CPU low Information must be precise to be useful for online optimization

Oct 20056 Related work Profile guided optimization –Code positioning [PettisPLDI90] Hardware performance monitors –Relating HPM data to basic blocks [Ammons PLDI97] –“Vertical profiling” [Hauswirth OOPSLA 2004] Dynamic optimization –Mississippi delta [Adl-Tabatabai PLDI2004] –Object reordering [Huang OOPSLA 2004] Our work: –No instrumentation –Use profile data + hardware info –Targets fully automatic dynamic optimization

Oct 20057 Hardware performance monitors Sampling-based counting –CPU reports state every n events –Precision platform-dependent (pipelines, out-of-order execution) Sampling provides method, basic block, or instruction-level information –Newer CPUs support precise sampling (e.g. P4, Itanium)

Oct 20058 Hardware performance monitors Way to localize performance bottlenecks –Sampling interval determines how fine- grained the information is Smaller sampling interval  more data –Trade-off: precision vs. runtime overhead –Need enough samples for a representative picture of the program behavior

Oct 20059 Implementation Main parts 1.Kernel module: low level access to hardware, per process counting 2.User-space library: hides kernel & device driver details from VM 3.Java VM thread: collects samples periodically, maps samples to Java code –Implemented on top of Jikes RVM

Oct 200510 System overview

Oct 200511 Implementation Supported events: –L1 and L2 cache misses –DTLB misses –Branch prediction Parameters of the monitoring module: –Buffer size (fixed) –Polling interval (fixed) –Sampling interval (adaptive) Keep runtime overhead constant by changing interval during run-time automatically

Oct 200512 From raw data to Java Determine method + bytecode instr –Build sorted method table –Map offset to bytecode 0x080485e1: mov 0x4(%esi),%esi 0x080485e4: mov $0x4,%edi 0x080485e9: mov (%esi,%edi,4),%esi 0x080485ec: mov %ebx,0x4(%esi) 0x080485ef: mov $0x4,%ebx 0x080485f4: push %ebx 0x080485f5: mov $0x0,%ebx 0x080485fa: push %ebx 0x080485fb: mov 0x8(%ebp),%ebx 0x080485fe: push %ebx 0x080485ff: mov (%ebx),%ebx 0x08048601: call *0x4(%ebx) 0x08048604: add $0xc,%esp 0x08048607: mov 0x8(%ebp),%ebx 0x0804860a: mov 0x4(%ebx),%ebx GETFIELD ARRAYLOAD INVOKEVIRTUAL

Oct 200513 From raw data to Java Sample gives PC + register contents PC  machine code  compiled Java code  bytecode instruction For data address: use registers + machine code to calculate target address: –GETFIELD  indirect load mov 12(eax), eax // 12 = offset of field

Oct 200514 Engineering issues Lookup of PC to get method / BC instr must be efficient –Done in parallel with user program –Use binary search / hash table –Update at recompilation, GC Identify 100% of instructions (PCs): –Include samples from application, VM, and library code –Dealing with native parts

Oct 200515 Infrastructure Jikes RVM 2.3.5 on Linux 2.4 kernel as runtime platform Pentium 4, 3 GHz, 1G RAM, 1M L2 cache Measured data show: –Runtime overhead –Extraction of meaningful information

Oct 200516 Runtime overhead ProgramOrig [sec], [score] Sampling interval 10000 Sampling interval 1000 javac7.182.0%2.4% raytrace4.042.4%2.0% jess2.930.6%0.1% jack2.733.5%2.7% db10.490.1%3.1% compress6.500.9%1.5% mpegaudio6.541.3%0.3% jbb6209.672.4%4.6% average1.6%2.1% Experiment setup: monitor L2 cache misses

Oct 200517 Runtime overhead: specJBB Total cost / sample: ~ 3000 cycles

Oct 200518 Measurements Measure which instructions produce most events (cache misses, branch mispred) –Potential for data locality and control flow optimizations Compare different spec-benchmarks –Find “hot spots”: instructions that produce 80% of all measured events

Oct 200519 L1/L2 Cache misses 80% quantile = 21 instructions (N=571) 80% quantile = 13 (N=295)

Oct 200520 L1/L2 Cache misses 80% quantile = 76 (N=2361) 80% quantile = 477 (N=8526)

Oct 200521 L1/L2 Cache misses 80% quantile = 1296 (N=3172) 80% quantile = 153 (N=672)

Oct 200522 Branch prediction 80% quantile = 307 (N=4193) 80% quantile = 1575 (N=7478)

Oct 200523 Summary 80%-quantile in % of totalL1 missesL2 missesBranch pred. specJBB5.6%3.2%7.3% javac40.9%22.7%21.1% db3.7%4.4%0.8% Distribution of events over program differ significantly between benchmarks Challenge: Are data precise enough to guide optimizations in a dynamic compiler?

Oct 200524 Further work Apply information in optimizer –Data: access path expressions p.x.y –Control flow: inlining, I-cache locality Investigate flexible sampling interval Further optimizations of monitoring system –Replacing expensive JNI calls –Avoid copying of samples

Oct 200525 Concluding remarks Precise performance event monitoring is possible with low overhead (~ 2%) Monitoring infrastructure tied into Jikes RVM compiler Instruction level information allows optimizations to focus on “hot spots” Good platform to study coupling compiler decisions to hardware-specific platform properties

Oct 200526 Backup: P4 performance counters P4 stores samples in OS supplied buffer –Interrupt only generated when buffer is filled –Lower runtime overhead All registers are stored together with IP –Possible to obtain data address profiles Only subset of events available for PEBS –Future architectures may support more EAXEBXECXEDXESIEDIEBPESPEIP

Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Similar presentations

Presentation on theme: "Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.

Similar presentations

Presentation on theme: "Oct 20051 Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich."— Presentation transcript:

Similar presentations

About project

Feedback