Efficient Binary Translation In Co-Designed Virtual Machines

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.

Virtual Machines: Versatile Platforms for Systems and Processes

Introduction 1-1 Introduction to Virtual Machines From “Virtual Machines” Smith and Nair Chapter 1.

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

An Approach for Implementing Efficient Superscalar CISC Processors

Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith Future Superscalar Processors Based on Instruction Compounding.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.

Using Dynamic Binary Translation to Fuse Dependent Instructions Shiliang Hu & James E. Smith.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Full and Para Virtualization

Introduction Why are virtual machines interesting?

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Kernel Design & Implementation

Advanced Architectures

15-740/ Computer Architecture Lecture 3: Performance

Lecture 3: MIPS Instruction Set

A Closer Look at Instruction Set Architectures

William Stallings Computer Organization and Architecture 8th Edition

Virtual Machines: Versatile Platforms for Systems and Processes

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

SECTIONS 1-7 By Astha Chawla

ISA's, Compilers, and Assembly

Architecture & Organization 1

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Improving Program Efficiency by Packing Instructions Into Registers

Georgia Tech November 2006 J. E. Smith

Flow Path Model of Superscalars

Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

Computer Architecture

Instruction Level Parallelism and Superscalar Processors

Superscalar Processors & VLIW Processors

Architecture & Organization 1

Central Processing Unit

Hardware Multithreading

EE 382N Guest Lecture Wish Branches

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Virtual Machines (Introduction to Virtual Machines)

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Computer Architecture

Computer Evolution and Performance

What is Computer Architecture?

Introduction to Microprocessor Programming

What is Computer Architecture?

What is Computer Architecture?

Chapter 12 Pipelining and RISC

Introduction to Virtual Machines

Co-designed Virtual Machines for Reliable Computer Systems

ARM ORGANISATION.

Created by Vivi Sahfitri

Computer Architecture

Introduction to Virtual Machines

Lecture 4: Instruction Set Design/Pipelining

A Level Computer Science Topic 5: Computer Architecture and Assembly

CSE378 Introduction to Machine Organization

Chapter 4 The Von Neumann Model

Presentation transcript:

Efficient Binary Translation In Co-Designed Virtual Machines Feb. 28th, 2006 -- Shiliang Hu Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling  Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

The Dilemma: Binary Compatibility Two Fundamentals for Computer Architects: Computer Applications: Ever-expanding Software development is expensive Software porting is also costly. Standard software binary distribution format(s) Implementation Technology: Ever-evolving Silicon technology has been evolving rapidly – Moore’ Law Trend: Always needs ISA architecture innovation. Dilemma: Binary Compatibility Cause: Coupled Software Binary Distribution Format and Hardware/Software Interface. Conflicting Trends Defense: x86vm, 02/28/2006

Solution: Dynamic ISA Mapping Software in Architected ISA: OS, Drivers, Lib code, Apps Architected ISA e.g. x86 Pipeline Decoders Conventional HW design Pipeline Code $ Software Binary Translator VM paradigm Dynamic Translation Implementation ISA e.g. fusible ISA Solution paradigms HW Implementation: Processors, Mem-sys, I/O devices ISA mapping: Hardware-intensive translation: good for startup performance. Software dynamic optimization: good for hotspots. Can we combine the advantages of both? Startup: Fast, hardware-intensive translation Steady State: Intelligent translation / optimization for hotspots. Defense: x86vm, 02/28/2006

Key: Efficient Binary Translation Startup curves for Windows workloads 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 Finish Time: Cycles Cumulative x86 IPC (normalized) Ref: Superscalar VM: Interp & SBT VM: BBT & SBT VM: Steady state Solution parasigms Defense: x86vm, 02/28/2006

Issue: Bad Case Scenarios Short-Running & Fine-Grain Cooperating Tasks Performance lost to slow startup cannot be compensated for before the tasks end. Real Time Applications Real time constraints can be compromised due to slow. translation process. Multi-tasking Server-like Applications Frequent context switches b/w resource competing tasks. Limited code cache size causes re-translations when switched in and out. OS boot-up & shutdown (Client, mobile platforms) Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Related Work: State-of-the Art VM Pioneer: IBM System/38, AS/400 Products: Transmeta x86 Processors Code Morphing Software + VLIW engine. Crusoe  Efficeon. Research Project: IBM DAISY, BOA Full system translator + VLIW engine. DBT overhead: 4000+ PowerPC instrs per translated instr. Other research projects: DBT for ILDP (H. Kim & JES). Dynamic Binary Translation / Optimization SW based: (Often user mode only) UQBT, Dynamo (RIO), IA-32 EL. Java and .NET HLL VM runtime systems. HW based: Trace cache fill units, rePLay, PARROT, etc. Defense: x86vm, 02/28/2006

Thesis Contributions Efficient Dynamic Binary Translation (DBT) DBT runtime overhead modeling  translation strategy. Efficient software translation algorithms. Simple hardware accelerators for DBT. Macro-op Execution μ-Arch (w/ Kim, Lipasti) Higher IPC and Higher clock speed potential. Reduced complexity at critical pipeline stages. An Integrated Co-designed x86 Virtual Machine Superior steady-state performance. Competitive startup performance. Complexity-effective, power efficient. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling  Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

The x86vm Framework Goal: Experimental research infrastructure to explore the co-designed x86 VM paradigm Our co-designed x86 virtual machine design Software components: VMM. Hardware components: Microarchitecture timing simulators. Internal implementation ISA. Object Oriented Design & Implementation in C++ Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

The x86vm Framework Software in Architected ISA: OS, Drivers, Lib code & Apps x 86 instructions Architected ISA e.g. x86 RISC - ops BOCHS 2.2 x86vmm DBT VMM Runtime Software Code Cache(s) Macro - ops Implementation ISA e.g. Fusible ISA Hardware Model : Microarchitecture, Timing Simulator Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Two-stage DBT system VMM runtime DBT Orchestrate VM system EXE. Runtime resource mgmt. Precise state recovery. DBT BBT: Basic block translator. SBT: Hot superblock translator & optimizer. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Evaluation Methodology Reference/Baseline: Best performing x86 processors Approximation to Intel Pentium M, AMD K7/K8. Experimental Data Collection: Simulation Different models need different instantiations of x86vm. Benchmarks SPEC 2000 integer (SPEC2K). Binary generation: Intel C/C++ compiler –O3 base opt. Test data inputs, full runs. Winstone2004 business suite (WSB2004), 500-million x86 instruction traces for 10 common Windows applications. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

x86 Binary Characterization Instruction Count Expansion (x86 to RISC-ops): 40%+ for SPEC2K. 50%- for Windows workloads. Instruction management and communication. Redundancy and inefficiency. Code footprint expansion: Nearly double if cracked into 32b fixed length RISC-ops. 30~40% if cracked into 16/32b RISC-ops. Affects fetch efficiency, memory hierarchy performance. Microarchitecture details: dual mode pipeline Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Overview of Baseline VM Design Goal: demonstrate the power of a VM paradigm via a specific x86 processor design Architected ISA: x86 Co-designed VM software: SBT, BBT and VM runtime control system. Implementation ISA: Fusible ISA  FISA Efficient microarchitecture enabled: macro-op execution Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Fusible Instruction Set - 21-bit Immediate / Displacement / 10 b opcode 11b Immediate / Disp 5b Rds 5b Rsrc 16-bit immediate / Displacement F Core 32-bit instruction formats Add-on 16-bit instruction formats for code density Fusible ISA Instruction Formats RISC-ops with unique features: A fusible bit per instr. for fusing. Dense encoding, 16/32-bit ISA. Special Features to Support x86 Condition codes. Addressing modes Aware of long immediate values. F F 10 b opcode 5b Rds F 10 b opcode 5b Rsrc F 16 bit opcode 5b Rsrc 5b Rds Processor Architecture 5b opcode 10b Immediate / Disp F 5b opcode 5b Immd 5b Rds F 5b opcode 5b Rsrc 5b Rds Defense: x86vm, 02/28/2006

VMM: Virtual Machine Monitor Runtime Controller Orchestrate VM operation, translation, translated code, etc. Code Cache(s) & management Hold translations, lookup translations & chaining, eviction policy. BBT: Initial emulation Straightforward cracking, no optimization.. SBT: Hotspot optimizer Fuse dependent Instruction pairs into macro-ops. Precise state recovery routines Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Microarchitecture: Macro-op Execution Enhanced OoO superscalar microarchitecture Process & execute fused macro-ops as single Instructions throughout the entire pipeline. Analogy: All lanes  car-pool on highway  reduce congestion w/ high throughput, AND raise the speed limit from 65mph to 80mph. Joint work with I. Kim & M. Lipasti. 3-1 ALUs Overview of macro-op execution cache ports Fuse bit Decode Align Wake- Fetch Rename Select RF EXE WB MEM Retire Fuse up Dispatch Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Co-designed x86 pipeline frond-end 1 2 3 4 5 slot 0 slot 1 slot 2 6 Align / Fuse Decode Dispatch Rename Fetch 16 Bytes Microarchitecture details: Macro-op Formation Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Co-designed x86 pipeline backend Wakeup Select Payload RF EXE WB/ Mem 2-cycle Macro-op Scheduler lane 0 dual entry 2 read ports issue port 0 lane 1 issue port 1 Port 0 ALU0 3 - 1ALU0 lane 2 issue port 2 Port 1 ALU1 ALU2 1ALU2 1ALU1 Microarchitecture details: Macro-op Execution Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling  Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists to DBT (HW) An Example Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Performance: Memory Hierarchy Perspective Disk Startup Initial program startup, module or task reloading after swap. Memory Startup Long duration context switch, phase changes. x86 code is still in memory, translated code is not. Code Cache Transient / Startup Short duration context switch, phase changes. Steady State Translated code is available and placed in the memory hierarchy. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Memory Startup Curves Hot threshold Defense: x86vm, 02/28/2006

Hotspot Behavior: WSB2004 (100M) 10 20 30 40 50 60 70 80 90 100 1+ 10+ 100+ 1,000+ 10,000+ 100,000+ 1,000,000+ 10,000,000+ Execution frequency Number of static x86 instrs (X 1000) Static x86 instruction execution frequency. Hot threshold Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling  Translation Strategy Efficient DBT Software for the x86 Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

(Hotspot) Translation Procedure 1. Translation Unit Formation: Form hotspot superblock. 2. IR generation: Crack x86 instructions into RISC-style micro-ops. 3. Machine State Mapping: Perform Cluster Analysis of embedded long immediate values and assign to registers if necessary. 4. Dependency Graph Construction for the superblock. 5. Macro-op Fusing Algorithm: Scan looking for dependent pairs to be fused. Forward scan, backward pairing. Two-pass fusing to prioritize ALU ops. 6. Register Allocation: re-order fused dependent pairs together, extend live ranges for precise traps, use consistent state mapping at superblock exits. 7. Code generation to code cache. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Macro-op Fusing Algorithm Objectives: Maximize fused dependent pairs. Simple & Fast. Heuristics: Pipelined Issue Logic: Only single-cycle ALU ops can be a head. Minimize non-fused single-cycle ALU ops. Criticality: Fuse instructions that are “close” in the original sequence. ALU-ops criticality is easier to estimate. Simplicity: 2 or fewer distinct register operands per fused pair. Two-pass Fusing Algorithm: The 1st pass, forward scan, prioritizes ALU ops, i.e. for each ALU-op tail candidate, look backward in the scan for its head. The 2nd pass considers all kinds of RISC-ops as tail candidates Macro-op Fusing: DBT software Defense: x86vm, 02/28/2006

Dependence Cycle Detection All cases are generalized to (c) due to Anti-Scan Fusing Heuristic. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Fusing Algorithm: Example x86 asm: ----------------------------------------------------------- 1. lea eax, DS:[edi + 01] 2. mov [DS:080b8658], eax 3. movzx ebx, SS:[ebp + ecx << 1] 4. and eax, 0000007f 5. mov edx, DS:[eax + esi << 0 + 0x7c] RISC-ops: ----------------------------------------------------- 1. ADD Reax, Redi, 1 2. ST Reax, mem[R14] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. AND Reax, 0000007f 5. ADD R11, Reax, Resi 6. LD Redx, mem[R11 + 0x7c] After fusing: Macro-ops ----------------------------------------------------- 1. ADD R12, Redi, 1 :: AND Reax, R12, 007f 2. ST R12, mem[R14] 3. LD.zx Rebx, mem[Rebp + Recx << 1] 4. ADD R11, Reax, Resi :: LD Rebx,mem[R11+0x7c] Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Instruction Fusing Profile (SPEC2K) 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf Average Percentage of Dynamic Instructions ALU FP or NOPs BR ST LD Fused Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Instruction Fusing Profile (WSB2004) Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Macro-op Fusing Profile Of all fused macro-ops: 52% / 43% are ALU-ALU pairs. 30% / 35% are fused condition test & conditional branch pairs. 18% / 22% are ALU-MEM ops pairs. Of all fused macro-ops. 70+% are inter-x86instruction fusion. 46% access two distinct source registers. 15% (i.e. 6% of all instruction entities) write two distinct destination registers. SPEC WSB Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

DBT Software Runtime Overhead Profile Software BBT Profile BBT overhead ΔBBT: About 105 FISA instructions (85 cycles) per translated x86 instruction. Mostly for decoding, cracking. Software SBT Profile SBT overhead ΔSBT: About 1000+ instructions per translated hotspot instruction. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Infrastructure Efficient Dynamic Binary Translation DBT Modeling  Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Principles for Hardware Assist Design Goals: Reduce VMM runtime software overhead significantly. Maintain HW complexity-effectiveness & power efficiency. HW & SW simply each other  Synergetic co-design. Observations: (Analytic Modeling & Simulation) High performance startup (non-hotspot) is critical. Hotspot is usually a small fraction of the overall footprint. Approach: BBT accelerators Front-end Dual mode decoders. Backend HW assist functional unit(s). Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Dual Mode CISC (x86) Decoders x86  μ- ops decoder x86 instruction μ - op decoder Opcode Operand Designators Fusible RISC-ops Other pipeline Control signals Basic idea: 2-stage decoder for CISC ISA First published in Motorola 68K processor papers. Break a Monolithic complex decoder into two separate simpler decoder stages. Dual-mode CISC decoders CISC (x86) instructions pass both stages. Internal RISC-ops pass only the second stage. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Dual Mode CISC (x86) Decoders Advantages: High performance startup  similar to conventional superscalar design. No code cache needed for non-hotspot code. Smooth transition from conventional superscalar design. Disadvantages: Complexity: n-wide machine needs n such decoders. But so does conventional design. Less power efficient (than other VM schemes). Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Hardware Assists as Functional Units Fetch Align / Fuse Decode Rename Dispatch GP ISSUE Q FP MM ISSUE Q MM Register File 128 b X 32 F M - ADD Functional Unit ( s ) to Assist VMM – MUL DIV and Other long lat ops LD ST Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Hardware Assists as Functional Units Advantages: High performance startup. Power Efficient. Programmability and flexibility. Simplicity: only one simplified decoder needed. Disadvantages: Runtime overhead: Reduce ΔBBT from 85 to about 20 cycles, but still involves some translation overhead. Memory space overhead: some extra code cache space for cold code. Must use VM paradigm, more risky than dual mode decoder. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Machine Startup Models Ref: Superscalar Conventional processor design as the baseline. VM.soft Software BBT and hotspot optimization via SBT. State-of-the-art VM design. VM.be BBT accelerated by backend functional unit(s). VM.fe Dual mode decoder at the pipeline front-end. Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Startup Evaluation: Hardware Assists 1.1 1 0.9 Ref: Superscalar 0.8 VM.soft 0.7 VM.be 0.6 Cumulative x86 IPC (Normalized) VM.fe 0.5 VM.steady-state 0.4 0.3 Add animation? 0.2 0.1 1 10 100 1,000 10,000 100,000 Finish 1,000,000 10,000,000 100,000,000 Time: Cycles Defense: x86vm, 02/28/2006

Activity of HW Assists Defense: x86vm, 02/28/2006

Related Work on HW Assists for DBT HW Assists for Profiling Profile Buffer [Conte’96]: BBB & Hotspot Detector [Merten’99], Programmable HW Path Profiler [CGO’05] etc Profiling Co-Processor [Zilles’01] Many others…. HW Assists for General VM Technology System VM Assists: Intel VT, AMD Pacifica. Transmeta Efficeon Processor: A new execute instruction to accelerate interpreter. HW Assists for Translation Trace Cache Fill Unit [Friendly’98] Customized buffer and instructions: rePLay, PARROT. Instruction Path Co-Processor [Zhou’00] etc. Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Framework Efficient Dynamic Binary Translation DBT Modeling & Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Co-designed x86 processor architecture horizontal 1 Memory Hierarchy micro / Macro - op 2 decoder VM translation / optimization software x86 code vertical x86 decoder Rename/ Dispatch Issue buffer Pipeline EXE backend I - $ Code $ (Macro-op) Microarchitecture details: dual mode pipeline Co-designed virtual machine paradigm Startup: Simple hardware decode/crack for fast translation. Steady State: Dynamic software translation/optimization for hotspots. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Reduced Instr. traffic throughout Processor Pipeline Rename Dispatch wakeup Fetch Align Payload RF EXE WB Retire x86 Decode3 Select Decode2 X86 Decode1 Pipelined 2-cycle Issue Logic Align/ Fuse Decode Macro-op Pipeline - x86 Pipeline Reduced Instr. traffic throughout Reduced forwarding Pipelined scheduler Microarchitecture details: dual mode pipeline Macro-op pipeline for efficient hotspot execution Execute macro-ops. Higher IPC, and Higher clock speed potential. Shorter pipeline front-end. Defense: x86vm, 02/28/2006

Performance Simulation Configuration  Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Processor Evaluation: SPEC2K  Defense: x86vm, 02/28/2006

Processor Evaluation: WSB2004  Defense: x86vm, 02/28/2006

Performance Contributors Many factors contribute to the IPC performance improvement: Code straightening. Macro-op fusing and execution. Shortened pipeline front-end (reduced branch penalty). Collapsed 3-1 ALUs (resolve branches & addresses sooner). Besides baseline and macro-op models, we model three middle configurations: M0: baseline + code cache. M1: M0 + macro-op fusing. M2: M1 + shorter pipeline front-end. (Macro-op mode). Macro-op: M2 + collapsed 3-1 ALUs. Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Performance Contributors: SPEC2K Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Performance Contributors: WSB2004 Experimental Evaluation: IPC contributors. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Outline Introduction The x86vm Framework Efficient Dynamic Binary Translation DBT Modeling & Translation Strategy Efficient DBT for the x86 (SW) Hardware Assists for DBT (HW) A Co-Designed x86 Processor Conclusions Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Conclusions The Co-Designed Virtual Machine Paradigm: Capability: Binary compatibility. Functionality integration. Functionality dynamic upgrading. Performance: Enables novel efficient architectures. Superior steady-state, competitive startup. Complexity Effectiveness: More flexibility for processor design, etc. Power Efficiency. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Conclusions The co-designed x86 processor: Capability Full x86 compatibility. Performance and Power efficiency Macro-op execution engine + DBT are efficient combination. Complexity Effectiveness VMM  DBT SW removes considerable HW complexity. μ-Arch  Can reduce pipeline width. Simplified 2-cycle pipelined scheduler. Simplified operand forwarding network… Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Finale – Questions & Answers Suggestions and comments are welcome, Thank you! Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Acknowledgement Prof. James E. Smith (Advisor) Prof. Mikko H. Lipasti Dr. Ho-Seop Kim Dr. Ilhyun Kim Mr. Wooseok Chang Wisconsin Architecture Group Funding: NSF, IBM & Intel My Family Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Objectives for Computer Architects Efficient Designs for Computer Systems More benefits Less costs Specifically, Three Dimensions: Capability  Practically, software code can run High Performance  Higher IPC, Higher clock speeds Simplicity / Complexity-Effectiveness  Less cost, more reliable. Also  Low power consumption Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

VM Software VM Mode Translated Native Mode VM runtime Initial emulation of the Architected ISA binary DBT translation. Exception handling Translated Native Mode Hotspot code executed in the code cache, chained together. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Fuse Macro-ops: An Illustrative Example Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Why DBT Modeling ? Understand translation overhead dynamics ? Profiling only gives samples Translation strategy to reduce runtime overhead HW / SW division, collaboration Translation stages, trigger mechanisms Hot Threshold Setting Too low, excessive false positives Too high, lose performance benefits ? XLT time EXE time Microarchitecture details: dual mode pipeline ? BBT SBT Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Modeling via Memory Hierarchy Perspective Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Analytic Modeling: DBT Overhead SBT C$ Performance Ref: Superscalar BBT C$ MSBT * ΔSBT MBBT * ΔBBT x86 code Translation Complexity  Defense: x86vm, 02/28/2006

Analytic Modeling: Hot Threshold SBT C$ Performance Break Even Performance Speedup Hot threshold BBT C$ Translation Overhead x86 code Execution frequency  Defense: x86vm, 02/28/2006

Modeling Evaluation  Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Modeling Evaluation  Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Model Implications Translation overhead Hotspot Detection Mainly caused by initial emulation Hotspot translation overhead is not significant (or can be always beneficial) if no translation is allowed for false positive hotspots. Hotspot Detection Model explains many empirical threshold settings in the literature. Excessively low thresholds  excessive false positive hotspots  excessive hotspot optimization overhead. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Efficient Dynamic Binary Translation Efficient Binary Translation Strategy: Adaptive runtime optimization Efficient algorithms for translation & optimization Co-op with Hardware Accelerators. Efficient Translated Native Code Generic optimizations: Native Register mapping & long immediate values register allocation. Implementation ISA optimizations: Enable new microarchitecture execution  Macro-op Fusing. Architected ISA optimizations: Runtime x86 specific optimizations. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Register Mapping & LIMM Conversion Objectives: Efficient emulation of the x86 ISA on the fusible ISA 32b long immediate values embedded in x86 binary are problematic for 32b instruction set  remove them Solution: Register mapping Map all x86 GPR to lower 16 R registers in the fusible ISA Map all x86 FP/multimedia registers to lower 24 F registers in the fusable ISA Solution: Long Immediate Conversion Scan superblock looking for all long immediate values. Perform value clustering analysis and allocate registers to frequent long immediate values. Convert some x86 embedded long immediate values into register access or register plus a short immediate that can be handled in implementation ISA. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Code Re-ordering Algorithm Problem: A new code scheduling algorithm is needed to group dependent instructions together. Modern compilers schedule independent instructions, not dependent pairs. Idea: Partitioning MidSet  PreSet & PostSet PreSet: all instructions that must be moved before the head MidSet: all instructions in the middle b/w the head and tail PostSet: all instructions that can be moved after the tail. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Code Re-ordering Example In: RISC-ops w/ fusing info. Out: macro-ops sequence 1. ADD R12, Redi, 1 2. ST R12, mem[R14] 3. LD Rebx, mem[Rebp+Recx] 4. AND Reax, R12, 007f 5. ADD R11, Reax, Resi 6. LD Redx, mem[R11 + 7c] Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

BBT overhead profile Distributed evenly into fetch/decode, semantics, loop & x86 crack. BBT: 100 instrs. per x86 inst. overhead per x86 instruction, similar to main memory access Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

HST back-end profile Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk. Heavy-weight opts: uops translation, fusing, codegen – none dominates Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

HW Assisted DBT Overhead (100M) Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Breakeven Points for Individual Bench Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Hotspot Detected vs. runs Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Hotspot Coverage vs. threshold Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Hotspot Coverage vs. runs Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

DBT Complexity/Overhead Trade-off Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Performance Evaluation: SPEC2000 Experimental Evaluation: IPC trend  Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Co-designed x86 processor Architecture Enhancement Hardware/Software co-designed paradigm to enable novel designs & more desirable system features. Fusing dependent instruction pairs  collapses dataflow graph to reduce instruction mgmt and inter-ins comm. Complexity Effectiveness Pipelined 2-cycle instruction scheduler. Reduce ALU value forwarding network significantly. DBT software reduces a lot of hardware complexity. Power Consumption Implication Reduced pipeline width. Reduced Inter-instruction communication and instruction management. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Processor Design Challenges CISC challenges -- Suboptimal internal micro-ops. Complex decoders & obsolete features/instructions Instruction count expansion:  mgmt, comm … Redundancy & Inefficiency in the cracked micro-ops Solution: Dynamic optimization Other current challenges (CISC & RISC) Efficiency (Nowadays, less performance gain per transistor) Power consumption has become acute Solution: Novel efficient microarchitectures Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Superscalar x86 Pipeline Challenges Atomic Scheduler Wake-up Select Fetch x86 Decode1 x86 Decode2 x86 Decode3 Align Rename Dispatch RF EXE WB Retire Best performing x86 processors, BUT In general, several critical pipeline stages: Branch behavior  Fetch bandwidth. Complex x86 decoders at the pipeline front-end. Complex Issue logic, wake-up and select in the same cycle. Complex operand forwarding networks  wire delays. Instruction count expansion: High pressure on instruction mgmt and communication mechanisms. Microarchitecture details: dual mode pipeline Defense: x86vm, 02/28/2006

Related Work: x86 processors AMD K7/K8 microarchitecture Macro-Operations High performance, efficient pipeline Intel Pentium M Micro-op fusion. Stack manager. High performance, low power. Transmeta x86 processors Co-Designed x86 VM VLIW engine + code morphing software. Related work Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006

Future Directions Co-Designed Virtual Machine Technology: Confidence: More realistic, exhaustive benchmark study – important for whole workload behavior Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques. Application: Specific enabling techniques for specific novel computer architectures of the future. Example co-designed x86 processor design: Confidence Study as above. Enhancement: HW μ-Arch  Reduce register write ports. VMM  More dynamic optimizations in SBT, e.g. CSE, software stack manager, SIMDification. Thesis Defense, Feb. 2006 Defense: x86vm, 02/28/2006