Emulation: Interpretation and Binary Translation

Emulation: Interpretation and Binary Translation
This material is based on the book, Virtual Machines: Versatile Platforms for Systems and Processes, Copyright 2005 by Elsevier Inc. All rights reserved. It has been used and/or modified with permission from Elsevier Inc.

Key VM Technologies Emulation: binary in one ISA is executed on processor supporting a different ISA Dynamic Optimization: binary is improved for higher performance May be done as part of emulation May optimize same ISA (no emulation needed) X86 apps HP Apps. Figs1/fx32 figs1/dynamo Windows HP UX Alpha HP PA ISA Emulation Optimization Emulation

Definitions NOTE -- there are no standard definitions… Emulation:
A method for enabling a (sub)system to present the same interface and characteristics as another. E.g. the execution of programs compiled for instruction set A on a machine that executes instruction set B. Simulation: A method for modeling a (sub)system’s operation Objective is usually to study the process; not just to imitate the function Typically emulation is part of the simulation process Ways of implementing emulation Interpretation: relatively inefficient instruction-at-a-time Binary Translation: block-at-a-time optimized for repeated instruction executions Emulation

Definitions Guest Host
Environment that is being supported by underlying platform Host Underlying platform that provides guest environment Guest Host supported by Figs3/srctrg Emulation

Definitions Source ISA or binary Target ISA or binary
Original instruction set or binary I.e. the instruction set to be emulated Target ISA or binary Instruction set being executed by processor performing emulation I.e. the underlying instruction set Or the binary that is actually executed Sometimes Confusing terminology, e.g. shade: Target → Host Source/Target refer to ISAs; Guest/Host refer to platforms. Source Target emulated by Figs3/srctrg Emulation

Big Picture In practice: staged operation Interpretation
Binary Translation and Caching Use profiling to determine which Emulation Interpreter manager Figs3/binopt Binary Memory Code Cache Profile Data Image Translator/ Optimizer Emulation

Interpreters HLL Interpreters have a very long history
Lisp Perl Forth (notable for its threaded interpretation model) Binary interpreters use many of the same techniques Often simplified Some performance tradeoffs are different E.g. the significance of using an intermediate form Emulation

Interpreter State Hold complete source state in interpreter’s data memory Code Data Stack Program Counter Condition Codes Reg 0 Reg 1 Reg n-1 . Interpreter Code Fgis3/intstate Emulation

Decode-Dispatch Interpretation
while (!halt && !interrupt) { inst = code[PC]; opcode = extract(inst,31,6); switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .} } Instruction function list Emulation

Instruction Functions: Load
LoadWordAndZero(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; } Emulation

Instruction Functions:ALU
ALU(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract(inst,10,10); switch(extended_opcode) { case Add: Add(inst); case AddCarrying: AddCarrying(inst); case AddExtended: AddExtended(inst); . . .} PC = PC + 4; } Emulation

Decode-Dispatch: Efficiency
Decode-Dispatch Loop Mostly serial code Case statement (hard-to-predict indirect jump) Call to function routine Return Executing an add instruction Approximately 20 target instructions Several loads/stores Several shift/mask steps Hand-coding can lead to better performance Example: DEC/Compaq FX!32 Emulation

Threaded Interpretation
Remove decode-dispatch jump inefficiency (By getting rid of decode-dispatch loop) Copy decode-dispatch into the function routines Source code “threads” together function routines Emulation

Threaded Interpretation Example
LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6) extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine; Emulation

Threaded Interpretation Example
Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6); extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine; Emulation

Comparison Decode-dispatch Threaded Emulation interpreter interpreter
source code routines source code routines "data" accesses dispatch loop Decode-dispatch Threaded Emulation

Optimization: Pre-Decoding
Scan static instructions and convert to internal pre-decoded form Separate the opcode, operands, etc. Perhaps put into intermediate form Stack form RISC-like micro-ops Reduces shift/masks significantly lwz r1, 8(r2) add r3, r3,r1 stw r3, 0(r4) Emulation

Direct Threaded Interpretation
Eliminate indirection through dispatch table Replace opcodes with PCs of routines Becomes dependent on locations of interpreter routines Limits portability of interpreter Emulation

Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine; Emulation

intermediate interpreter code routines source code pre- decoder Emulation

Interpreting x86 CISC ISA Instruction Format: Prefixes Opcode ModR/M
SIB Displacement Immediate 0 to 4 optional 0,1,2,4 bytes Emulation

. . . Interpreting x86 Maintain general purpose instruction template
Parse and fill template Then dispatch/execute General Decode (fill-in instruction structure) Dispatch . . . Inst. 1 Inst. 2 Inst. n specialized specialized specialized routine routine routine Emulation

X86 Instruction Template (after bochs)
struct IA-32instr { unsigned short opcode, prefixmask; char ilen; // instruction length. InterpreterFunctionPointer execute; //semantic routine struct { //general addr. computation:[Rbase+(Rindex << shmt)+ disp.] char mode; // addressing mode, including register operand. char Rbase; // base address register char Rindex;// index register char shmt; // index scale factor long displacement; } operandRM; char mode; // either register or immediate. char regname; // register number long immediate;// immediate value } operandRI; } instr; Emulation

Add Instruction // ADD: register + Reg/Mem --> register
ADD_RX_32b(IA-32instr instr) // X means: either Reg or Mem { unsigned op1_32, op2_32, sum_32; op1_32 = IA-32_GPR[instr.operandRI.regname]; if (mem_operand(instr.operandRM.mode)) { unsigned mem_addr = resolve_mem_address(instr); op2_32 = virtual_mem(instr.segment, mem_addr); } else { op2_32 = IA-32_GPR[instr.operandRM.Rbase]; sum_32 = op1_32 + op2_32; IA-32_GPR[instr.operandRI.regname] = sum_32; SET_IA-32_CC_FLAGS(op1_32,op2_32,sum_32,IA-32_INSTR_ADD32); Emulation

Faster x86 Interpreter Make common cases fast ... ... Emulation
Dispatch on Make common cases fast first byte Simple Simple Complex Complex ... Inst. 1 Inst. m Inst. m+1 ... Inst. n Prefix specialized specialized specialized specialized set flags routine routine routine routine Shared Routines Emulation

Threaded x86 Interpreter
Thread common instructions Decode/dispatch uncommon instructions Complex Decode/ Dispatch Simple Simple Complex Complex ... Instruction Instruction Instruction ... Instruction specialized specialized specialized specialized routine routine routine routine Simple Simple Simple Simple Decode/ Decode/ Decode/ Decode/ Dispatch Dispatch Dispatch Dispatch Emulation

FX!32 interpreter: inner loop
Software pipelining overlaps dispatch table lookup with execution of previous inst. Note: pressure on D-cache while interpreting Uses 64K dispatch table indexed by opcode Keeps next 8 bytes of x86 code in registers; inst I plus part of inst I+1 Loop: Use length(inst. I) to access first two bytes of inst. I+1 Table lookup with first two bytes of I+1 get pointer to interpreter routine get length(inst. I+1) Initiate fetch of additional inst. bytes for I+1 and beginning of I+2 Dispatch to execute inst. I Repeat Loop Emulation

Binary Translation Generate “custom” code for every source instruction
Get rid of instruction “parsing” and jumps altogether Emulation

Binary Translation Example
x86 Source Binary addl %edx,4(%eax) movl 4(%eax),%edx add %eax,4 Translate to PowerPC Target r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value Emulation

Binary Translation Example
lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memory lwz r4,12(r1) ;load %edx from register block add r5,r4,r5 ;perform add stw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes) stwx r4,r2,r5 ;store %edx value into memory addi r4,r4,4 ;add immediate stw r4,0(r1) ;place result back into %eax Emulation

Binary Translation binary translated target code source code binary
translator Emulation

Optimization: Register Mapping
Keep copy of register state in target memory Spill registers when needed by emulator Easier if #target regs > #source regs Register mapping may be on a per-block basis If #target registers not enough Reduces loads/stores significantly Emulation

Register Mapping Example
r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value r4 holds x86 register %eax r7 holds x86 register %edx etc. addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memory add r7,r17,r7 ;perform add of %edx stwx r7,r2,r16 ;store %edx value into memory addi r4,r4,4 ;increment %eax addi r3,r3,9 ;update PC (9 bytes) Emulation

The Code Discovery Problem
In order to translate, emulator must be able to “discover” code Easier said than done Consider x86 code mov %ch,0 ?? 31 c0 8b b b bd movl %esi, 0x (%ebp) ?? Emulation

The Code Discovery Problem
There are a number of contributors to the code discovery problem source ISA instructions inst. 1 inst. 2 data in instruction inst. 3 jump stream reg. data inst. 5 inst. 6 uncond. brnch pad pad for instruction inst. 8 alignment jump indirect to??? Emulation

The Code Location Problem
Source and target binaries use different PCs On indirect jump, PC is source, but jump should go to target x86 source code movl %eax, 4(%esp) ;load jump address from memory jmp %eax ;jump indirect through %eax PowerPC target code addi r16,r11,4 ;compute x86 address lwzx r4,r2,r16 ;get x86 jump address ; from x86 memory image mtctr r ;move to count register bctr ;jump indirect through ctr Emulation

Simplified Special Cases
Fixed Mapping All instructions fixed length on known boundaries Like many RISCs Data and pads may be translated, but that’s OK Data will come from original source locations (later) Designed In No jumps or branches to arbitrary locations All jumps via procedure (method) call No data or pads mixed with instructions Java has these properties All code in a method can be “discovered” when method is entered Emulation

Dynamic Translation First Interpret Translate Code Emulation process
And perform code discovery as a byproduct Translate Code Incrementally, as it is discovered Place translated blocks into Code Memory Save source to target PC mappings in lookup table Emulation process Execute translated block to end Lookup next source PC in table If translated, jump to target PC Else interpret and translate Emulation

Dynamic Translation Emulation Emulation Manager SPC to TPC Lookup
source binary Translation Memory SPC to TPC Lookup Table hit miss translator Interpreter Emulation

Dynamic Blocks Initially, Dynamic Basic Blocks
Later, larger blocks containing branches Static Dynamic Basic Blocks Basic Blocks add... add... load... block 1 load... store ... store ... loop: load ... loop: load ... block 1 add ..... block 2 add ..... store store brcond skip brcond skip load... load... block 3 sub... sub... block 2 skip: add... skip: add... store block 4 store brcond loop brcond loop add... loop: load ... load... add ..... block 3 store... block 5 store jump indirect brcond skip ... skip: add... ... store block 4 brcond loop Emulation ... ...

Flow of Control Control flows between translated blocks and Emulation Manager translation block translation Emulation Manager block translation block Emulation ... ...

Tracking the Source PC General Method:
Can always update SPC as part of translated code Better to place SPC in stub Code Block Branch and Link to EM Next Source PC Emulation Manager Hash Table General Method: Translator returns to EM via BL Source PC placed in stub immediately after BL EM can then use link register to find source PC and hash to next target code block Emulation ... ...

Emulation Manager Flowchart
... ...

Example Emulation x86 Binary PowerPC Translation 4FD0: addl
%edx,(%eax) ;load and accumulate sum movl (%eax),%edx ;store to memory sub %ebx,1 ;decrement loop count jz 51C8 ;branch if at loop end 4FDC: add %eax,4 ;increment %eax jmp 4FD0 ;jump to loop top 51C8: (%ecx),%edx ;store last value of %edx xorl %edx,%edx ;clear %edx 6200 ;jump elsewhere x86 Binary PowerPC Translation Emulation

Example Stub BAL to Emulation Mgr. Branch is taken to stub
HASH TABLE SPC 51C8 TPC 9C08 link ////// PowerPC Translation 9AC0: lwz r16,0(r4) ;load value from memory add r7,r7,r16 ;accumulate sum 1 stw r7,0(r5) ;store to memory addic. r5,r5,-1 ;decrement loop count, set cr0 beq cr0,pc+12 ;branch if loop exit 6 2 bl F000 ;branch & link to EM 4FDC ;save source PC in link register 8 9AE4: bl F000 3 ;branch & link to EM 51C8 ;save source PC in link register Emulation Manager 4 9C08: stw r7,0(r6) ;store last value of %edx xor r7,r7,r7 ;clear %edx 10 bl F000 ;branch & link to EM 5 6200 ;save source PC in link register 9 7 4 3 5 2 7 Stub BAL to Emulation Mgr. Branch is taken to stub Branch to transfer code Translated basic block is executed 6 10 8 EM loads SPC from stub, using link Continue execution EM loads SPC from hash tbl; compares Load TPC from hash table Jump indirect to next translated block EM hashes SPC and does lookup 9 1 Emulation

Translation Chaining Jump from one translation directly to next
Avoid switching back to EM Without Chaining: With Chaining: translation translation block block translation translation block Figs3/nochain figs3/chain VMM block VMM translation translation block block translation block Emulation

Creating a Link get next SPC Lookup Successor Set up chain 1 2
Predecessor Lookup 3 Successor Jump TPC JAL TM 4 next SPC 5 Set up Figs3/nochain figs3/chain chain Successor Emulation

Example Figs3/nochain figs3/chain Emulation

Translation Chaining Translation-terminating branch is Simple Solution
Unconditional: jump directly to next translation Conditional: link each exit separately Indirect jumps: Simple Solution Always return to EM Dynamic Chaining Software jump prediction translation block VMM Figs3/chain Emulation

Software Jump Prediction
Form of “Inline Caching” Example Code: Say Rx holds source branch address addr_i are predicted addresses (in probability order) Determined via profiling target_i are corresponding target code blocks If Rx == addr_1 goto target_1 Else if Rx == addr_2 goto target_2 Else if Rx == addr_3 goto target_3 Else hash_lookup(Rx) ; do it the slow way Emulation

Call/Returns Calls/Returns use source stack for PC save/restore
Stack contents can be changed between call and return Causes problem with return PC on stack is source PC Need target PC  do hash table lookup on every return? Solution: Use Shadow Stack Emulation

Shadow Stack Integrate into target stack, maintain with translated code Calling routine Allocates shadow stack frame Saves source return address and source stack pointer Called routine Saves target address in caller’s shadow stack frame On return Called routine checks source return address and stack pointer against saved version in shadow frame If no change, use target return address Else return to EM; apply source return address to hash table Emulation

Shadow Stack Example x86 stack shadow stack copy on call
x86 return address copy on call return address PPC return address if compare succeeds x86 stack pointer x86 stack pointer x86 return address compare on return Emulation

Other Important Issues…
To be covered later: Self-modifying Code Self-referencing Code Precise Traps Emulation

Source/Target ISA Issues
Register architectures Condition codes Data formats and operations Floating point Decimal MMX Address resolution Byte vs Word addressing Address alignment Natural vs arbitrary Byte order Big/Little endian Emulation

Registers Local Allocation Global Allocation Regional Allocation
Assign target to source registers for one translation Best case: each target register read/written only once per translation Global Allocation Assign target registers permanently; e.g. PC, stack pointer Regional Allocation Registers assignment fixed for a region of translations Check to see which assignment is in effect Write back registers when region changes If #source registers << #target registers Assign all registers permanently Converse => many save/restores Emulation

Condition Codes Both source and target have CCs
Target CCs saved/restored frequently e.g. may be implicitly set If different CC semantics, may be simpler to save input values and operation. Then generate CCs only when actually needed. Source has no CCs; target does Target emulates compare/branch Some comparisons may require multiple compare/branches Source has CCs; target does not Target must emulate CCs each time they are needed Or save input values and perform lazy compare/branch when needed Emulation

Lazy Condition Code Evaluation
add %ecx,%ebx jmp label1 . . . label1: jz target R4  eax PPC to R5  ebx x86 register R6  ecx mappings . R24  scratch register used by emulation code R25  condition code operand 1 ;registers R26  condition code operand 2 ;used for R27  condition code operation ;lazy condition ;emulation code R28  jump table base address Emulation

Lazy Condition Code Evaluation
mr r25,r6 ;save operands mr r26,r5 ; and opcode for li r27,“add” ; lazy condition code emulation add r6,r6,r5 ;translation of add b label1 ... label1: bl genZF ;branch and link genZF code beq cr0,target ;branch on condition flag genZF: add r29,r28,r27 ;add “opcode” to jump table base mtctr r29 ;copy to counter register bctr ;branch via jump table “add”: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return Emulation

Data Formats Different floating point formats
Convert source values to target values Re-convert prior to stores and at translation boundaries Results will not be identical For identical results, detailed emulation is required (very expensive) Same floating point formats No conversion problem IEEE compliant => no problem with results, either Multimedia (MMX) Usually simple data types Per-instruction pre-translated libraries may be best solution Emulation

Address Resolution Byte addressing vs. Word addressing
Not many word addressed machines left Bytes on a word machine Lots of shifting and masking Words on a byte machine Low order address bits maintained at zero May be address space problems byte machine has smaller total address space Emulation

Address Alignment Some ISAs allow addresses to be aligned on arbitrary byte boundaries; “Unaligned” Others only allow “natural” alignment Unaligned source on natural aligned target has high overhead Multiple loads/stores, shifts, logicals Test alignment first, and only perform multiple ops if unaligned Can generate two translation bodies, check alignment first, then branch to proper body Many aligned machines already have provisions for unaligned data Emulation

Byte Order Target code must perform byte-reordering
Can be very inefficient 8-10 target instructions per host load Some target ISAs may support both E.g. MIPS, IA-64 Emulation

Case Study: SUN Shade A fast instruction set simulator
Good case study for understanding important issues Uses binary translation for speed uncommon in simulators typically interpretation is used Paper contains interesting discussion regarding simulation We focus on binary translation aspect Emulation

Shade Code Structures TC: Translation Cache
VMEM TLB TC Text Data Figs3/shadest TC: Translation Cache TLB: Translation Lookaside Buffer (Hash Table) vpc: virtual (source) PC; held in a register Stack Emulation

Shade Translations Translation Units Stops at special instructions
Basic block or superblock Some source code fragments may be in more than one translation block Stops at special instructions Synchronization, indirect branches, software traps Limit on translation block size To simplify storage allocation Translation Cache Size 4MB Emulation

Shade TLB (Hash Table) Maps source PC to cached translation
(unfortunate, confusing name) An array of lists (e.g. n-way set assoc) Each list is n <source,target> pairs TLB Size 256KB (32K entries) Figs3/shhash Emulation

Shade Main Simulation Loop
Hashes vpc (source PC) to TLB array index Fast partial search of TLB entry for vpc value If hit, use target translation If miss, call function to do full search of TLB entries If hit, move hit entry to front of hash list If miss, generate translation, place at front of list Finding Successor block Code block places next source PC in vpc then jumps to VMM Emulation

Shade Chaining Setup Successor translated first:
Predecessor is translated with branch to chain pointer built in Translator can use hash table to find chain pointer Predecessor translated first: Block finishes with BAL to VMM Link register contains PC of BAL instruction (slot where a chaining branch should go) Translator overwrites BAL with branch to successor Indirect Jump Chaining: Not used; always return to VMM BAL to VMM get next src PC Predecessor Set up chain B trgt PC BAL VMM Translate Successor next src PC Figs3/chain1 modified Emulation

Emulation Summary Decode/Dispatch Interpretation
source code dispatch loop interpreter routines "data" accesses Decode/Dispatch Interpretation Memory Requirements: Low Startup: Fast Steady State Perf.: Slow Portability: Excellent Scanned from shade paper Emulation

Emulation Summary Indirect Threaded Interpretation
source code interpreter routines Indirect Threaded Interpretation Memory Requirements: Low Slightly more than decode/dispatch due to replicated dispatch code Startup: Fast Steady State Perf.: Slow Portability: Excellent Scanned from shade paper Emulation

Emulation Summary Direct Threaded Interpretation
source code pre- decoder interpreter routines intermediate code Direct Threaded Interpretation Memory Requirements: High Startup: Slow Steady State Perf. : Medium Portability: Medium Predecoded version is location-specific Scanned from shade paper Emulation

Emulation Summary Binary Translation Memory Requirements: High
source code binary translator binary translated target code Binary Translation Memory Requirements: High Startup: Very Slow Steady State Perf.: Fast Portability: Poor Scanned from shade paper Emulation

Emulation: Interpretation and Binary Translation

Similar presentations

Presentation on theme: "Emulation: Interpretation and Binary Translation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Emulation: Interpretation and Binary Translation

Similar presentations

Presentation on theme: "Emulation: Interpretation and Binary Translation"— Presentation transcript:

Similar presentations

About project

Feedback