Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs

www.intel.com/labs 2 Introduction  Itanium processor is statically scheduled machine  Aggressive compiler techniques to extract ILP  Just-In-Time (JIT) compiler must be fast  Must consider time & space efficiency of optimizations  Balance compilation time with code quality  Light-weight compilation techniques  Use heuristics for modeling micro architecture  Leverage semantics and meta data of JVM

www.intel.com/labs 3 Outline  Introduction  Compiler overview  Register allocation  Code scheduling  Other optimizations  Conclusions

www.intel.com/labs 4 Compiler Structure Prepass Inlining Global optimizations IR construction Code Selection Register Allocation Code Emission GC Support Front-end Back-end Code Scheduling Predication

www.intel.com/labs 5 Register Allocation  Compilation time vs. code quality tradeoff  IPF architecture has large register files  128 integer, 128 floating-point, 64 predicate, 8 branch  Register Stack Engine (RSE) provides 96 stack registers to each procedure  Use linear scan register allocation  “Linear Scan Register Allocation” by Massimiliano Poletto and Vivek Sarkar

www.intel.com/labs 6 Live Range vs. Live Interval... t1= t1=...... v =t1 v = t1 = v...= v B1 B2B3 B4 t2= t2=...... v =t2 v = t2 t1= t1=...... v =t1 v = t1 t2= t2=...... v = t2 = v...= v... B1 B2 B4 B3 Live Ranges Live Intervals

www.intel.com/labs 7 Coalescing Algorithm  Coalesce v and t in v = t iff  Live interval of t ends at v = t  Live interval of t does not intersect with live range of v  Requires one additional reverse pass over IR  O(N INST + N VAR * N BB ) t1= t1=...... v =t1 v = t1 t2= t2=...... v = t2 = v...= v... B1 B2 B4 B3

www.intel.com/labs 8 Coalescing Speedup

www.intel.com/labs 9 Code Scheduling  Forward cycle-based list scheduling  Scheduling unit is extended basic block  Middle exits are due to run-time exceptions (p6,p7) = cmp.eq r35, 0 (p6,p7) = cmp.eq r35, 0 (p6) br ThrowNullPointerException r10 = r35 + 16 r10 = r35 + 16 r11 = ld8 [r10] r11 = ld8 [r10]

www.intel.com/labs 10 Type-based memory disambiguation  Use JVM meta data to disambiguate memory locations  Type  Integer, floating-point, object reference …  Kind  Object field, array element, virtual table address …  Field id  putfield #10 vs. putfield #15

www.intel.com/labs 11 Type-Based Disambiguation

www.intel.com/labs 12 Exception Dependencies  Java exceptions are precise  Naive approach  Exception checks end basic blocks  Our approach  Instruction depends on exception check iff  Its destination is live at the exception handler, or  It is an exception check for different exception type  It is a memory reference that may be guarded by check

www.intel.com/labs 13 Exception Dependency Example 1:(p6, p0) = cmp.eq r16, 0 2:(p6)brThrowNullPointerException 2: (p6)br ThrowNullPointerException 6: f8 = fld [r21]// load static 5: r21 = movl 0x000F14E32019000 4:r18 = ld [r17]// load field 3:r17 = add r16, 8

www.intel.com/labs 14 Exception Dependencies

www.intel.com/labs 15 IPF Architecture  Execution (functional) unit type – M, I, F, B  Instruction (syllable type) – M, A, I, F, B, IL  Bundles, templates .mii.mi;;i.mil.mmi.m;;mi.mfi.mmf.mib.mbb.bbb.mmb.mfb  Instruction group – no WAR, WAW with some exceptions.mi;;ir10 = ld [r15] r9 = add r8, 1 ;; // stop bit r16 = shr r9, r32

www.intel.com/labs 16 Template Selection  Pack instructions into bundles  Choose slot for each instruction  Insert NOP instructions  Assign instructions to functional units Problem: Resource over subscription Inaccurate bypass latencies

www.intel.com/labs 17  Greedy slot assignment  Sort instruction by syllable type  M < F < IL < I < A < B I1: r20 = sxt r14 (I-type) I2: r21 = movl ADDR (IL-type) I3: f15 = fadd f10, f11 (F-type) Algorithm NOPI1 NOP I2 NOPI3 NOP Unsorted NOPI3I1 NOPI2 Sorted

www.intel.com/labs 18 Template Selection Heuristics

www.intel.com/labs 19 Bypass Latency Accuracy r17 = add r16, 8 M-Unit r17 = add r16, 8 I-Unit r18 = ld [r17] M-Unit 12 Phase ordering of functional unit assignment Code selection time is too early: underutilizes resources Template selection time too late: inaccurate scheduling latencies Solution: Assign to functional unit during scheduling Assign to M-Unit if available, else Assign to I-Unit and increment latency

www.intel.com/labs 20 Modeling of Address Computation Latency

www.intel.com/labs 21 Other optimizations  Predication  Profitability depends on a benchmark  Performance variations within 2%  Branch hints  Up to 50% speedup from using branch hints  Sign-extension elimination  1% potential gain for our compiler

www.intel.com/labs 22 Conclusions  Light-weight optimizations techniques for Itanium  Considering micro architecture is important  Cannot ignore bypass latencies  Template selection should be resource sensitive  Language semantics helps to improve ILP  Type-based memory disambiguation  Exception dependency elimination

Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

Similar presentations

Presentation on theme: "Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

Similar presentations

Presentation on theme: "Www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs."— Presentation transcript:

Similar presentations

About project

Feedback