Download presentation
Presentation is loading. Please wait.
Published byAnissa Watkins Modified over 9 years ago
1
www.intel.com/labs Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs
2
www.intel.com/labs 2 Introduction Itanium processor is statically scheduled machine Aggressive compiler techniques to extract ILP Just-In-Time (JIT) compiler must be fast Must consider time & space efficiency of optimizations Balance compilation time with code quality Light-weight compilation techniques Use heuristics for modeling micro architecture Leverage semantics and meta data of JVM
3
www.intel.com/labs 3 Outline Introduction Compiler overview Register allocation Code scheduling Other optimizations Conclusions
4
www.intel.com/labs 4 Compiler Structure Prepass Inlining Global optimizations IR construction Code Selection Register Allocation Code Emission GC Support Front-end Back-end Code Scheduling Predication
5
www.intel.com/labs 5 Register Allocation Compilation time vs. code quality tradeoff IPF architecture has large register files 128 integer, 128 floating-point, 64 predicate, 8 branch Register Stack Engine (RSE) provides 96 stack registers to each procedure Use linear scan register allocation “Linear Scan Register Allocation” by Massimiliano Poletto and Vivek Sarkar
6
www.intel.com/labs 6 Live Range vs. Live Interval... t1= t1=...... v =t1 v = t1 = v...= v B1 B2B3 B4 t2= t2=...... v =t2 v = t2 t1= t1=...... v =t1 v = t1 t2= t2=...... v = t2 = v...= v... B1 B2 B4 B3 Live Ranges Live Intervals
7
www.intel.com/labs 7 Coalescing Algorithm Coalesce v and t in v = t iff Live interval of t ends at v = t Live interval of t does not intersect with live range of v Requires one additional reverse pass over IR O(N INST + N VAR * N BB ) t1= t1=...... v =t1 v = t1 t2= t2=...... v = t2 = v...= v... B1 B2 B4 B3
8
www.intel.com/labs 8 Coalescing Speedup
9
www.intel.com/labs 9 Code Scheduling Forward cycle-based list scheduling Scheduling unit is extended basic block Middle exits are due to run-time exceptions (p6,p7) = cmp.eq r35, 0 (p6,p7) = cmp.eq r35, 0 (p6) br ThrowNullPointerException r10 = r35 + 16 r10 = r35 + 16 r11 = ld8 [r10] r11 = ld8 [r10]
10
www.intel.com/labs 10 Type-based memory disambiguation Use JVM meta data to disambiguate memory locations Type Integer, floating-point, object reference … Kind Object field, array element, virtual table address … Field id putfield #10 vs. putfield #15
11
www.intel.com/labs 11 Type-Based Disambiguation
12
www.intel.com/labs 12 Exception Dependencies Java exceptions are precise Naive approach Exception checks end basic blocks Our approach Instruction depends on exception check iff Its destination is live at the exception handler, or It is an exception check for different exception type It is a memory reference that may be guarded by check
13
www.intel.com/labs 13 Exception Dependency Example 1:(p6, p0) = cmp.eq r16, 0 2:(p6)brThrowNullPointerException 2: (p6)br ThrowNullPointerException 6: f8 = fld [r21]// load static 5: r21 = movl 0x000F14E32019000 4:r18 = ld [r17]// load field 3:r17 = add r16, 8
14
www.intel.com/labs 14 Exception Dependencies
15
www.intel.com/labs 15 IPF Architecture Execution (functional) unit type – M, I, F, B Instruction (syllable type) – M, A, I, F, B, IL Bundles, templates .mii.mi;;i.mil.mmi.m;;mi.mfi.mmf.mib.mbb.bbb.mmb.mfb Instruction group – no WAR, WAW with some exceptions.mi;;ir10 = ld [r15] r9 = add r8, 1 ;; // stop bit r16 = shr r9, r32
16
www.intel.com/labs 16 Template Selection Pack instructions into bundles Choose slot for each instruction Insert NOP instructions Assign instructions to functional units Problem: Resource over subscription Inaccurate bypass latencies
17
www.intel.com/labs 17 Greedy slot assignment Sort instruction by syllable type M < F < IL < I < A < B I1: r20 = sxt r14 (I-type) I2: r21 = movl ADDR (IL-type) I3: f15 = fadd f10, f11 (F-type) Algorithm NOPI1 NOP I2 NOPI3 NOP Unsorted NOPI3I1 NOPI2 Sorted
18
www.intel.com/labs 18 Template Selection Heuristics
19
www.intel.com/labs 19 Bypass Latency Accuracy r17 = add r16, 8 M-Unit r17 = add r16, 8 I-Unit r18 = ld [r17] M-Unit 12 Phase ordering of functional unit assignment Code selection time is too early: underutilizes resources Template selection time too late: inaccurate scheduling latencies Solution: Assign to functional unit during scheduling Assign to M-Unit if available, else Assign to I-Unit and increment latency
20
www.intel.com/labs 20 Modeling of Address Computation Latency
21
www.intel.com/labs 21 Other optimizations Predication Profitability depends on a benchmark Performance variations within 2% Branch hints Up to 50% speedup from using branch hints Sign-extension elimination 1% potential gain for our compiler
22
www.intel.com/labs 22 Conclusions Light-weight optimizations techniques for Itanium Considering micro architecture is important Cannot ignore bypass latencies Template selection should be resource sensitive Language semantics helps to improve ILP Type-based memory disambiguation Exception dependency elimination
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.