Efficient Binary Translation In Co-Designed Virtual Machines Feb. 28th, 2006 -- Shiliang Hu Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
The Dilemma: Binary Compatibility Two Fundamentals for Computer Architects: Computer Applications: Ever-expanding Software development is expensive Software porting is also costly. Standard software binary distribution format(s) Implementation Technology: Ever-evolving Silicon technology has been evolving rapidly – Moore’ Law Trend: Always needs ISA architecture innovation. Dilemma: Binary Compatibility Cause: Coupled Software Binary Distribution Format and Hardware/Software Interface. Conflicting Trends Defense: x86vm, 02/28/2006
Solution: Dynamic ISA Mapping Software in Architected ISA: OS, Drivers, Lib code, Apps Architected ISA e.g. x86 Pipeline Decoders Conventional HW design Pipeline Code $ Software Binary Translator VM paradigm Dynamic Translation Implementation ISA e.g. fusible ISA Solution paradigms HW Implementation: Processors, Mem-sys, I/O devices ISA mapping: Hardware-intensive translation: good for startup performance. Software dynamic optimization: good for hotspots. Can we combine the advantages of both? Startup: Fast, hardware-intensive translation Steady State: Intelligent translation / optimization for hotspots. Defense: x86vm, 02/28/2006
Key: Efficient Binary Translation Startup curves for Windows workloads 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 Finish Time: Cycles Cumulative x86 IPC (normalized) Ref: Superscalar VM: Interp & SBT VM: BBT & SBT VM: Steady state Solution parasigms Defense: x86vm, 02/28/2006
Issue: Bad Case Scenarios Short-Running & Fine-Grain Cooperating Tasks Performance lost to slow startup cannot be compensated for before the tasks end. Real Time Applications Real time constraints can be compromised due to slow. translation process. Multi-tasking Server-like Applications Frequent context switches b/w resource competing tasks. Limited code cache size causes re-translations when switched in and out. OS boot-up & shutdown (Client, mobile platforms) Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Related Work: State-of-the Art VM Pioneer: IBM System/38, AS/400 Products: Transmeta x86 Processors Code Morphing Software + VLIW engine. Crusoe Efficeon. Research Project: IBM DAISY, BOA Full system translator + VLIW engine. DBT overhead: 4000+ PowerPC instrs per translated instr. Other research projects: DBT for ILDP (H. Kim & JES). Dynamic Binary Translation / Optimization SW based: (Often user mode only) UQBT, Dynamo (RIO), IA-32 EL. Java and .NET HLL VM runtime systems. HW based: Trace cache fill units, rePLay, PARROT, etc. Defense: x86vm, 02/28/2006
Thesis Contributions Efficient Dynamic Binary Translation (DBT) DBT runtime overhead modeling translation strategy. Efficient software translation algorithms. Simple hardware accelerators for DBT. Macro-op Execution μ-Arch (w/ Kim, Lipasti) Higher IPC and Higher clock speed potential. Reduced complexity at critical pipeline stages. An Integrated Co-designed x86 Virtual Machine Superior steady-state performance. Competitive startup performance. Complexity-effective, power efficient. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
The x86vm Framework Goal: Experimental research infrastructure to explore the co-designed x86 VM paradigm Our co-designed x86 virtual machine design Software components: VMM. Hardware components: Microarchitecture timing simulators. Internal implementation ISA. Object Oriented Design & Implementation in C++ Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
The x86vm Framework Software in Architected ISA: OS, Drivers, Lib code & Apps x 86 instructions Architected ISA e.g. x86 RISC - ops BOCHS 2.2 x86vmm DBT VMM Runtime Software Code Cache(s) Macro - ops Implementation ISA e.g. Fusible ISA Hardware Model : Microarchitecture, Timing Simulator Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Two-stage DBT system VMM runtime DBT Orchestrate VM system EXE. Runtime resource mgmt. Precise state recovery. DBT BBT: Basic block translator. SBT: Hot super- block translator & optimizer. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Evaluation Methodology Reference/Baseline: Best performing x86 processors Approximation to Intel Pentium M, AMD K7/K8. Experimental Data Collection: Simulation Different models need different instantiations of x86vm. Benchmarks SPEC 2000 integer (SPEC2K). Binary generation: Intel C/C++ compiler –O3 base opt. Test data inputs, full runs. Winstone2004 business suite (WSB2004), 500-million x86 instruction traces for 10 common Windows applications. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
x86 Binary Characterization Instruction Count Expansion (x86 to RISC-ops): 40%+ for SPEC2K. 50%- for Windows workloads. Instruction management and communication. Redundancy and inefficiency. Code footprint expansion: Nearly double if cracked into 32b fixed length RISC-ops. 30~40% if cracked into 16/32b RISC-ops. Affects fetch efficiency, memory hierarchy performance. Microarchitecture details: dual mode pipeline Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Overview of Baseline VM Design Goal: demonstrate the power of a VM paradigm via a specific x86 processor design Architected ISA: x86 Co-designed VM software: SBT, BBT and VM runtime control system. Implementation ISA: Fusible ISA FISA Efficient microarchitecture enabled: macro-op execution Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Fusible Instruction Set - 21-bit Immediate / Displacement / 10 b opcode 11b Immediate / Disp 5b Rds 5b Rsrc 16-bit immediate / Displacement F Core 32-bit instruction formats Add-on 16-bit instruction formats for code density Fusible ISA Instruction Formats RISC-ops with unique features: A fusible bit per instr. for fusing. Dense encoding, 16/32-bit ISA. Special Features to Support x86 Condition codes. Addressing modes Aware of long immediate values. F F 10 b opcode 5b Rds F 10 b opcode 5b Rsrc F 16 bit opcode 5b Rsrc 5b Rds Processor Architecture 5b opcode 10b Immediate / Disp F 5b opcode 5b Immd 5b Rds F 5b opcode 5b Rsrc 5b Rds Defense: x86vm, 02/28/2006
VMM: Virtual Machine Monitor Runtime Controller Orchestrate VM operation, translation, translated code, etc. Code Cache(s) & management Hold translations, lookup translations & chaining, eviction policy. BBT: Initial emulation Straightforward cracking, no optimization.. SBT: Hotspot optimizer Fuse dependent Instruction pairs into macro-ops. Precise state recovery routines Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Microarchitecture: Macro-op Execution Enhanced OoO superscalar microarchitecture Process & execute fused macro-ops as single Instructions throughout the entire pipeline. Analogy: All lanes car-pool on highway reduce congestion w/ high throughput, AND raise the speed limit from 65mph to 80mph. Joint work with I. Kim & M. Lipasti. 3-1 ALUs Overview of macro-op execution cache ports Fuse bit Decode Align Wake- Fetch Rename Select RF EXE WB MEM Retire Fuse up Dispatch Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Co-designed x86 pipeline frond-end 1 2 3 4 5 slot 0 slot 1 slot 2 6 Align / Fuse Decode Dispatch Rename Fetch 16 Bytes Microarchitecture details: Macro-op Formation Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Co-designed x86 pipeline backend Wakeup Select Payload RF EXE WB/ Mem 2-cycle Macro-op Scheduler lane 0 dual entry 2 read ports issue port 0 lane 1 issue port 1 Port 0 ALU0 3 - 1ALU0 lane 2 issue port 2 Port 1 ALU1 ALU2 1ALU2 1ALU1 Microarchitecture details: Macro-op Execution Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists to DBT (HW) An Example Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Performance: Memory Hierarchy Perspective Disk Startup Initial program startup, module or task reloading after swap. Memory Startup Long duration context switch, phase changes. x86 code is still in memory, translated code is not. Code Cache Transient / Startup Short duration context switch, phase changes. Steady State Translated code is available and placed in the memory hierarchy. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Memory Startup Curves Hot threshold Defense: x86vm, 02/28/2006
Hotspot Behavior: WSB2004 (100M) 10 20 30 40 50 60 70 80 90 100 1+ 10+ 100+ 1,000+ 10,000+ 100,000+ 1,000,000+ 10,000,000+ Execution frequency Number of static x86 instrs (X 1000) Static x86 instruction execution frequency. Hot threshold Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling Translation Strategy Efficient DBT Software for the x86 Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
(Hotspot) Translation Procedure 1. Translation Unit Formation: Form hotspot superblock. 2. IR generation: Crack x86 instructions into RISC-style micro-ops. 3. Machine State Mapping: Perform Cluster Analysis of embedded long immediate values and assign to registers if necessary. 4. Dependency Graph Construction for the superblock. 5. Macro-op Fusing Algorithm: Scan looking for dependent pairs to be fused. Forward scan, backward pairing. Two-pass fusing to prioritize ALU ops. 6. Register Allocation: re-order fused dependent pairs together, extend live ranges for precise traps, use consistent state mapping at superblock exits. 7. Code generation to code cache. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Macro-op Fusing Algorithm Objectives: Maximize fused dependent pairs. Simple & Fast. Heuristics: Pipelined Issue Logic: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops. Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. Simplicity: 2 or fewer distinct register operands per fused pair. Two-pass Fusing Algorithm: The 1st pass, forward scan, prioritizes ALU ops, i.e. for each ALU-op tail candidate, look backward in the scan for its head. The 2nd pass considers all kinds of RISC-ops as tail candidates Macro-op Fusing: DBT software Defense: x86vm, 02/28/2006
Dependence Cycle Detection All cases are generalized to (c) due to Anti-Scan Fusing Heuristic. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Fusing Algorithm: Example x86 asm: ----------------------------------------------------------- 1. lea eax, DS:[edi + 01] 2. mov [DS:080b8658], eax 3. movzx ebx, SS:[ebp + ecx << 1] 4. and eax, 0000007f 5. mov edx, DS:[eax + esi << 0 + 0x7c] RISC-ops: ----------------------------------------------------- 1. ADD Reax, Redi, 1 2. ST Reax, mem[R14] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. AND Reax, 0000007f 5. ADD R11, Reax, Resi 6. LD Redx, mem[R11 + 0x7c] After fusing: Macro-ops ----------------------------------------------------- 1. ADD R12, Redi, 1 :: AND Reax, R12, 007f 2. ST R12, mem[R14] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ADD R11, Reax, Resi :: LD Rebx,mem[R11+0x7c] Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Instruction Fusing Profile (SPEC2K) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Average Percentage of Dynamic Instructions ALU FP or NOPs BR ST LD Fused Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Instruction Fusing Profile (WSB2004) Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Macro-op Fusing Profile Of all fused macro-ops: 52% / 43% are ALU-ALU pairs. 30% / 35% are fused condition test & conditional branch pairs. 18% / 22% are ALU-MEM ops pairs. Of all fused macro-ops. 70+% are inter-x86instruction fusion. 46% access two distinct source registers. 15% (i.e. 6% of all instruction entities) write two distinct destination registers. SPEC WSB Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
DBT Software Runtime Overhead Profile Software BBT Profile BBT overhead ΔBBT: About 105 FISA instructions (85 cycles) per translated x86 instruction. Mostly for decoding, cracking. Software SBT Profile SBT overhead ΔSBT: About 1000+ instructions per translated hotspot instruction. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Principles for Hardware Assist Design Goals: Reduce VMM runtime software overhead significantly. Maintain HW complexity-effectiveness & power efficiency. HW & SW simply each other Synergetic co-design. Observations: (Analytic Modeling & Simulation) High performance startup (non-hotspot) is critical. Hotspot is usually a small fraction of the overall footprint. Approach: BBT accelerators Front-end Dual mode decoders. Backend HW assist functional unit(s). Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Dual Mode CISC (x86) Decoders x86 μ- ops decoder x86 instruction μ - op decoder Opcode Operand Designators Fusible RISC-ops Other pipeline Control signals Basic idea: 2-stage decoder for CISC ISA First published in Motorola 68K processor papers. Break a Monolithic complex decoder into two separate simpler decoder stages. Dual-mode CISC decoders CISC (x86) instructions pass both stages. Internal RISC-ops pass only the second stage. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Dual Mode CISC (x86) Decoders Advantages: High performance startup similar to conventional superscalar design. No code cache needed for non-hotspot code. Smooth transition from conventional superscalar design. Disadvantages: Complexity: n-wide machine needs n such decoders. But so does conventional design. Less power efficient (than other VM schemes). Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Hardware Assists as Functional Units Fetch Align / Fuse Decode Rename Dispatch GP ISSUE Q FP MM ISSUE Q MM Register File 128 b X 32 F M - ADD Functional Unit ( s ) to Assist VMM – MUL DIV and Other long lat ops LD ST Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Hardware Assists as Functional Units Advantages: High performance startup. Power Efficient. Programmability and flexibility. Simplicity: only one simplified decoder needed. Disadvantages: Runtime overhead: Reduce ΔBBT from 85 to about 20 cycles, but still involves some translation overhead. Memory space overhead: some extra code cache space for cold code. Must use VM paradigm, more risky than dual mode decoder. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Machine Startup Models Ref: Superscalar Conventional processor design as the baseline. VM.soft Software BBT and hotspot optimization via SBT. State-of-the-art VM design. VM.be BBT accelerated by backend functional unit(s). VM.fe Dual mode decoder at the pipeline front-end. Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Startup Evaluation: Hardware Assists 1.1 1 0.9 Ref: Superscalar 0.8 VM.soft 0.7 VM.be 0.6 Cumulative x86 IPC (Normalized) VM.fe 0.5 VM.steady-state 0.4 0.3 Add animation? 0.2 0.1 1 10 100 1,000 10,000 100,000 Finish 1,000,000 10,000,000 100,000,000 Time: Cycles Defense: x86vm, 02/28/2006
Activity of HW Assists Defense: x86vm, 02/28/2006
Related Work on HW Assists for DBT HW Assists for Profiling Profile Buffer [Conte’96]: BBB & Hotspot Detector [Merten’99], Programmable HW Path Profiler [CGO’05] etc Profiling Co-Processor [Zilles’01] Many others…. HW Assists for General VM Technology System VM Assists: Intel VT, AMD Pacifica. Transmeta Efficeon Processor: A new execute instruction to accelerate interpreter. HW Assists for Translation Trace Cache Fill Unit [Friendly’98] Customized buffer and instructions: rePLay, PARROT. Instruction Path Co-Processor [Zhou’00] etc. Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Framework Efficient Dynamic Binary Translation DBT Modeling & Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Co-designed x86 processor architecture horizontal 1 Memory Hierarchy micro / Macro - op 2 decoder VM translation / optimization software x86 code vertical x86 decoder Rename/ Dispatch Issue buffer Pipeline EXE backend I - $ Code $ (Macro-op) Microarchitecture details: dual mode pipeline Co-designed virtual machine paradigm Startup: Simple hardware decode/crack for fast translation. Steady State: Dynamic software translation/optimization for hotspots. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Reduced Instr. traffic throughout Processor Pipeline Rename Dispatch wakeup Fetch Align Payload RF EXE WB Retire x86 Decode3 Select Decode2 X86 Decode1 Pipelined 2-cycle Issue Logic Align/ Fuse Decode Macro-op Pipeline - x86 Pipeline Reduced Instr. traffic throughout Reduced forwarding Pipelined scheduler Microarchitecture details: dual mode pipeline Macro-op pipeline for efficient hotspot execution Execute macro-ops. Higher IPC, and Higher clock speed potential. Shorter pipeline front-end. Defense: x86vm, 02/28/2006
Performance Simulation Configuration Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Processor Evaluation: SPEC2K Defense: x86vm, 02/28/2006
Processor Evaluation: WSB2004 Defense: x86vm, 02/28/2006
Performance Contributors Many factors contribute to the IPC performance improvement: Code straightening. Macro-op fusing and execution. Shortened pipeline front-end (reduced branch penalty). Collapsed 3-1 ALUs (resolve branches & addresses sooner). Besides baseline and macro-op models, we model three middle configurations: M0: baseline + code cache. M1: M0 + macro-op fusing. M2: M1 + shorter pipeline front-end. (Macro-op mode). Macro-op: M2 + collapsed 3-1 ALUs. Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Performance Contributors: SPEC2K Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Performance Contributors: WSB2004 Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Outline Introduction The x86vm Framework Efficient Dynamic Binary Translation DBT Modeling & Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Conclusions The Co-Designed Virtual Machine Paradigm: Capability: Binary compatibility. Functionality integration. Functionality dynamic upgrading. Performance: Enables novel efficient architectures. Superior steady-state, competitive startup. Complexity Effectiveness: More flexibility for processor design, etc. Power Efficiency. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Conclusions The co-designed x86 processor: Capability Full x86 compatibility. Performance and Power efficiency Macro-op execution engine + DBT are efficient combination. Complexity Effectiveness VMM DBT SW removes considerable HW complexity. μ-Arch Can reduce pipeline width. Simplified 2-cycle pipelined scheduler. Simplified operand forwarding network… Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Finale – Questions & Answers Suggestions and comments are welcome, Thank you! Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Acknowledgement Prof. James E. Smith (Advisor) Prof. Mikko H. Lipasti Dr. Ho-Seop Kim Dr. Ilhyun Kim Mr. Wooseok Chang Wisconsin Architecture Group Funding: NSF, IBM & Intel My Family Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Objectives for Computer Architects Efficient Designs for Computer Systems More benefits Less costs Specifically, Three Dimensions: Capability Practically, software code can run High Performance Higher IPC, Higher clock speeds Simplicity / Complexity-Effectiveness Less cost, more reliable. Also Low power consumption Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
VM Software VM Mode Translated Native Mode VM runtime Initial emulation of the Architected ISA binary DBT translation. Exception handling Translated Native Mode Hotspot code executed in the code cache, chained together. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Fuse Macro-ops: An Illustrative Example Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Why DBT Modeling ? Understand translation overhead dynamics ? Profiling only gives samples Translation strategy to reduce runtime overhead HW / SW division, collaboration Translation stages, trigger mechanisms Hot Threshold Setting Too low, excessive false positives Too high, lose performance benefits ? XLT time EXE time Microarchitecture details: dual mode pipeline ? BBT SBT Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Modeling via Memory Hierarchy Perspective Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Analytic Modeling: DBT Overhead SBT C$ Performance Ref: Superscalar BBT C$ MSBT * ΔSBT MBBT * ΔBBT x86 code Translation Complexity Defense: x86vm, 02/28/2006
Analytic Modeling: Hot Threshold SBT C$ Performance Break Even Performance Speedup Hot threshold BBT C$ Translation Overhead x86 code Execution frequency Defense: x86vm, 02/28/2006
Modeling Evaluation Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Modeling Evaluation Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Model Implications Translation overhead Hotspot Detection Mainly caused by initial emulation Hotspot translation overhead is not significant (or can be always beneficial) if no translation is allowed for false positive hotspots. Hotspot Detection Model explains many empirical threshold settings in the literature. Excessively low thresholds excessive false positive hotspots excessive hotspot optimization overhead. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Efficient Dynamic Binary Translation Efficient Binary Translation Strategy: Adaptive runtime optimization Efficient algorithms for translation & optimization Co-op with Hardware Accelerators. Efficient Translated Native Code Generic optimizations: Native Register mapping & long immediate values register allocation. Implementation ISA optimizations: Enable new microarchitecture execution Macro-op Fusing. Architected ISA optimizations: Runtime x86 specific optimizations. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Register Mapping & LIMM Conversion Objectives: Efficient emulation of the x86 ISA on the fusible ISA 32b long immediate values embedded in x86 binary are problematic for 32b instruction set remove them Solution: Register mapping Map all x86 GPR to lower 16 R registers in the fusible ISA Map all x86 FP/multimedia registers to lower 24 F registers in the fusable ISA Solution: Long Immediate Conversion Scan superblock looking for all long immediate values. Perform value clustering analysis and allocate registers to frequent long immediate values. Convert some x86 embedded long immediate values into register access or register plus a short immediate that can be handled in implementation ISA. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Code Re-ordering Algorithm Problem: A new code scheduling algorithm is needed to group dependent instructions together. Modern compilers schedule independent instructions, not dependent pairs. Idea: Partitioning MidSet PreSet & PostSet PreSet: all instructions that must be moved before the head MidSet: all instructions in the middle b/w the head and tail PostSet: all instructions that can be moved after the tail. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Code Re-ordering Example In: RISC-ops w/ fusing info. Out: macro-ops sequence 1. ADD R12, Redi, 1 2. ST R12, mem[R14] 3. LD Rebx, mem[Rebp+Recx] 4. AND Reax, R12, 007f 5. ADD R11, Reax, Resi 6. LD Redx, mem[R11 + 7c] Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
BBT overhead profile Distributed evenly into fetch/decode, semantics, loop & x86 crack. BBT: 100 instrs. per x86 inst. overhead per x86 instruction, similar to main memory access Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
HST back-end profile Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk. Heavy-weight opts: uops translation, fusing, codegen – none dominates Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
HW Assisted DBT Overhead (100M) Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Breakeven Points for Individual Bench Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Hotspot Detected vs. runs Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Hotspot Coverage vs. threshold Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Hotspot Coverage vs. runs Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
DBT Complexity/Overhead Trade-off Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Performance Evaluation: SPEC2000 Experimental Evaluation: IPC trend Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Co-designed x86 processor Architecture Enhancement Hardware/Software co-designed paradigm to enable novel designs & more desirable system features. Fusing dependent instruction pairs collapses dataflow graph to reduce instruction mgmt and inter-ins comm. Complexity Effectiveness Pipelined 2-cycle instruction scheduler. Reduce ALU value forwarding network significantly. DBT software reduces a lot of hardware complexity. Power Consumption Implication Reduced pipeline width. Reduced Inter-instruction communication and instruction management. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Processor Design Challenges CISC challenges -- Suboptimal internal micro-ops. Complex decoders & obsolete features/instructions Instruction count expansion: mgmt, comm … Redundancy & Inefficiency in the cracked micro-ops Solution: Dynamic optimization Other current challenges (CISC & RISC) Efficiency (Nowadays, less performance gain per transistor) Power consumption has become acute Solution: Novel efficient microarchitectures Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Superscalar x86 Pipeline Challenges Atomic Scheduler Wake-up Select Fetch x86 Decode1 x86 Decode2 x86 Decode3 Align Rename Dispatch RF EXE WB Retire Best performing x86 processors, BUT In general, several critical pipeline stages: Branch behavior Fetch bandwidth. Complex x86 decoders at the pipeline front-end. Complex Issue logic, wake-up and select in the same cycle. Complex operand forwarding networks wire delays. Instruction count expansion: High pressure on instruction mgmt and communication mechanisms. Microarchitecture details: dual mode pipeline Defense: x86vm, 02/28/2006
Related Work: x86 processors AMD K7/K8 microarchitecture Macro-Operations High performance, efficient pipeline Intel Pentium M Micro-op fusion. Stack manager. High performance, low power. Transmeta x86 processors Co-Designed x86 VM VLIW engine + code morphing software. Related work Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006
Future Directions Co-Designed Virtual Machine Technology: Confidence: More realistic, exhaustive benchmark study – important for whole workload behavior Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques. Application: Specific enabling techniques for specific novel computer architectures of the future. Example co-designed x86 processor design: Confidence Study as above. Enhancement: HW μ-Arch Reduce register write ports. VMM More dynamic optimizations in SBT, e.g. CSE, software stack manager, SIMDification. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006