Emulation: Interpretation and Binary Translation

Slides:



Advertisements
Similar presentations
1 Lecture 3: MIPS Instruction Set Today’s topic:  More MIPS instructions  Procedure call/return Reminder: Assignment 1 is on the class web-page (due.
Advertisements

Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Lecture 3: Instruction Set Principles Kai Bu
INSTRUCTION SET ARCHITECTURES
Computer Organization and Architecture
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Chapter 12 CPU Structure and Function. Example Register Organizations.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
CH12 CPU Structure and Function
CH13 Reduced Instruction Set Computers {Make hardware Simpler, but quicker} Key features  Large number of general purpose registers  Use of compiler.
Instruction Set Architecture
Machine Instruction Characteristics
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Lecture 4: MIPS Instruction Set
Computer Architecture and Organization
Oct. 25, 2000Systems Architecture I1 Systems Architecture I (CS ) Lecture 9: Alternative Instruction Sets * Jeremy R. Johnson Wed. Oct. 25, 2000.
Lecture 04: Instruction Set Principles Kai Bu
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
E Virtual Machines Lecture 2 CPU Virtualization Scott Devine VMware, Inc.
Displacement (Indexed) Stack
Instruction sets : Addressing modes and Formats
Non Contiguous Memory Allocation
Computer Architecture Instruction Set Architecture
CS2100 Computer Organisation
Vijay Janapa Reddi The University of Texas at Austin Interpretation 2
A Closer Look at Instruction Set Architectures
William Stallings Computer Organization and Architecture 8th Edition
Computer Architecture Instruction Set Architecture
Lecture 4: MIPS Instruction Set
ELEN 468 Advanced Logic Design
Chapter 4 Addressing modes
RISC Concepts, MIPS ISA Logic Design Tutorial 8.
William Stallings Computer Organization and Architecture 8th Edition
Improving Program Efficiency by Packing Instructions Into Registers
BIC 10503: COMPUTER ARCHITECTURE
Instructions - Type and Format
Lecture 4: MIPS Instruction Set
CS170 Computer Organization and Architecture I
The University of Adelaide, School of Computer Science
ECE232: Hardware Organization and Design
CS399 New Beginnings Jonathan Walpole.
Lecture 5: Procedure Calls
Instruction set architectures
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Introduction to Microprocessor Programming
Instruction Set Principles
Chapter 12 Pipelining and RISC
Evolution of ISA’s ISA’s have changed over computer “generations”.
CPU Structure CPU must:
Lecture 4: Instruction Set Design/Pipelining
Chapter 11 Processor Structure and function
Procedure Linkages Standard procedure linkage Procedure has
CSc 453 Interpreters & Interpretation
Dynamic Binary Translators and Instrumenters
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

Emulation: Interpretation and Binary Translation This material is based on the book, Virtual Machines: Versatile Platforms for Systems and Processes, Copyright 2005 by Elsevier Inc. All rights reserved. It has been used and/or modified with permission from Elsevier Inc.

Key VM Technologies Emulation: binary in one ISA is executed on processor supporting a different ISA Dynamic Optimization: binary is improved for higher performance May be done as part of emulation May optimize same ISA (no emulation needed) X86 apps HP Apps. Figs1/fx32 figs1/dynamo Windows HP UX Alpha HP PA ISA Emulation Optimization Emulation

Definitions NOTE -- there are no standard definitions… Emulation: A method for enabling a (sub)system to present the same interface and characteristics as another. E.g. the execution of programs compiled for instruction set A on a machine that executes instruction set B. Simulation: A method for modeling a (sub)system’s operation Objective is usually to study the process; not just to imitate the function Typically emulation is part of the simulation process Ways of implementing emulation Interpretation: relatively inefficient instruction-at-a-time Binary Translation: block-at-a-time optimized for repeated instruction executions Emulation

Definitions Guest Host Environment that is being supported by underlying platform Host Underlying platform that provides guest environment Guest Host supported by Figs3/srctrg Emulation

Definitions Source ISA or binary Target ISA or binary Original instruction set or binary I.e. the instruction set to be emulated Target ISA or binary Instruction set being executed by processor performing emulation I.e. the underlying instruction set Or the binary that is actually executed Sometimes Confusing terminology, e.g. shade: Target → Host Source/Target refer to ISAs; Guest/Host refer to platforms. Source Target emulated by Figs3/srctrg Emulation

Big Picture In practice: staged operation Interpretation Binary Translation and Caching Use profiling to determine which Emulation Interpreter manager Figs3/binopt Binary Memory Code Cache Profile Data Image Translator/ Optimizer Emulation

Interpreters HLL Interpreters have a very long history Lisp Perl Forth (notable for its threaded interpretation model) Binary interpreters use many of the same techniques Often simplified Some performance tradeoffs are different E.g. the significance of using an intermediate form Emulation

Interpreter State Hold complete source state in interpreter’s data memory Code Data Stack Program Counter Condition Codes Reg 0 Reg 1 Reg n-1 . Interpreter Code Fgis3/intstate Emulation

Decode-Dispatch Interpretation while (!halt && !interrupt) { inst = code[PC]; opcode = extract(inst,31,6); switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); . . .} } Instruction function list Emulation

Instruction Functions: Load LoadWordAndZero(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; } Emulation

Instruction Functions:ALU ALU(inst){ RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst, 15,5); source1 = regs[RA]; source2 = regs[RB]; extended_opcode = extract(inst,10,10); switch(extended_opcode) { case Add: Add(inst); case AddCarrying: AddCarrying(inst); case AddExtended: AddExtended(inst); . . .} PC = PC + 4; } Emulation

Decode-Dispatch: Efficiency Decode-Dispatch Loop Mostly serial code Case statement (hard-to-predict indirect jump) Call to function routine Return Executing an add instruction Approximately 20 target instructions Several loads/stores Several shift/mask steps Hand-coding can lead to better performance Example: DEC/Compaq FX!32 Emulation

Threaded Interpretation Remove decode-dispatch jump inefficiency (By getting rid of decode-dispatch loop) Copy decode-dispatch into the function routines Source code “threads” together function routines Emulation

Threaded Interpretation Example LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs(RA); address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC +4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6) extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine; Emulation

Threaded Interpretation Example Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs(RA); source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; If (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6); extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine; Emulation

Comparison Decode-dispatch Threaded Emulation interpreter interpreter source code routines source code routines "data" accesses dispatch loop Decode-dispatch Threaded Emulation

Optimization: Pre-Decoding Scan static instructions and convert to internal pre-decoded form Separate the opcode, operands, etc. Perhaps put into intermediate form Stack form RISC-like micro-ops Reduces shift/masks significantly lwz r1, 8(r2) add r3, r3,r1 stw r3, 0(r4) Emulation

Direct Threaded Interpretation Eliminate indirection through dispatch table Replace opcodes with PCs of routines Becomes dependent on locations of interpreter routines Limits portability of interpreter Emulation

Direct Threaded Interpretation Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; If (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine; Emulation

Direct Threaded Interpretation intermediate interpreter code routines source code pre- decoder Emulation

Interpreting x86 CISC ISA Instruction Format: Prefixes Opcode ModR/M SIB Displacement Immediate 0 to 4 optional 0,1,2,4 bytes Emulation

. . . Interpreting x86 Maintain general purpose instruction template Parse and fill template Then dispatch/execute General Decode (fill-in instruction structure) Dispatch . . . Inst. 1 Inst. 2 Inst. n specialized specialized specialized routine routine routine Emulation

X86 Instruction Template (after bochs) struct IA-32instr { unsigned short opcode, prefixmask; char ilen; // instruction length. InterpreterFunctionPointer execute; //semantic routine struct { //general addr. computation:[Rbase+(Rindex << shmt)+ disp.] char mode; // addressing mode, including register operand. char Rbase; // base address register char Rindex;// index register char shmt; // index scale factor long displacement; } operandRM; char mode; // either register or immediate. char regname; // register number long immediate;// immediate value } operandRI; } instr; Emulation

Add Instruction // ADD: register + Reg/Mem --> register ADD_RX_32b(IA-32instr instr) // X means: either Reg or Mem { unsigned op1_32, op2_32, sum_32; op1_32 = IA-32_GPR[instr.operandRI.regname]; if (mem_operand(instr.operandRM.mode)) { unsigned mem_addr = resolve_mem_address(instr); op2_32 = virtual_mem(instr.segment, mem_addr); } else { op2_32 = IA-32_GPR[instr.operandRM.Rbase]; sum_32 = op1_32 + op2_32; IA-32_GPR[instr.operandRI.regname] = sum_32; SET_IA-32_CC_FLAGS(op1_32,op2_32,sum_32,IA-32_INSTR_ADD32); Emulation

Faster x86 Interpreter Make common cases fast ... ... Emulation Dispatch on Make common cases fast first byte Simple Simple Complex Complex ... Inst. 1 Inst. m Inst. m+1 ... Inst. n Prefix specialized specialized specialized specialized set flags routine routine routine routine Shared Routines Emulation

Threaded x86 Interpreter Thread common instructions Decode/dispatch uncommon instructions Complex Decode/ Dispatch Simple Simple Complex Complex ... Instruction Instruction Instruction ... Instruction specialized specialized specialized specialized routine routine routine routine Simple Simple Simple Simple Decode/ Decode/ Decode/ Decode/ Dispatch Dispatch Dispatch Dispatch Emulation

FX!32 interpreter: inner loop Software pipelining overlaps dispatch table lookup with execution of previous inst. Note: pressure on D-cache while interpreting Uses 64K dispatch table indexed by opcode Keeps next 8 bytes of x86 code in registers; inst I plus part of inst I+1 Loop: Use length(inst. I) to access first two bytes of inst. I+1 Table lookup with first two bytes of I+1 get pointer to interpreter routine get length(inst. I+1) Initiate fetch of additional inst. bytes for I+1 and beginning of I+2 Dispatch to execute inst. I Repeat Loop Emulation

Binary Translation Generate “custom” code for every source instruction Get rid of instruction “parsing” and jumps altogether Emulation

Binary Translation Example x86 Source Binary addl %edx,4(%eax) movl 4(%eax),%edx add %eax,4 Translate to PowerPC Target r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value Emulation

Binary Translation Example lwz r4,0(r1) ;load %eax from register block addi r5,r4,4 ;add 4 to %eax lwzx r5,r2,r5 ;load operand from memory lwz r4,12(r1) ;load %edx from register block add r5,r4,r5 ;perform add stw r5,12(r1) ;put result into %edx addi r3,r3,3 ;update PC (3 bytes) stwx r4,r2,r5 ;store %edx value into memory addi r4,r4,4 ;add immediate stw r4,0(r1) ;place result back into %eax Emulation

Binary Translation binary translated target code source code binary translator Emulation

Optimization: Register Mapping Keep copy of register state in target memory Spill registers when needed by emulator Easier if #target regs > #source regs Register mapping may be on a per-block basis If #target registers not enough Reduces loads/stores significantly Emulation

Register Mapping Example r1 points to x86 register context block r2 points to x86 memory image r3 contains x86 ISA PC value r4 holds x86 register %eax r7 holds x86 register %edx etc. addi r16,r4,4 ;add 4 to %eax lwzx r17,r2,r16 ;load operand from memory add r7,r17,r7 ;perform add of %edx stwx r7,r2,r16 ;store %edx value into memory addi r4,r4,4 ;increment %eax addi r3,r3,9 ;update PC (9 bytes) Emulation

The Code Discovery Problem In order to translate, emulator must be able to “discover” code Easier said than done Consider x86 code mov %ch,0 ?? 31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 movl %esi, 0x08030000(%ebp) ?? Emulation

The Code Discovery Problem There are a number of contributors to the code discovery problem source ISA instructions inst. 1 inst. 2 data in instruction inst. 3 jump stream reg. data inst. 5 inst. 6 uncond. brnch pad pad for instruction inst. 8 alignment jump indirect to??? Emulation

The Code Location Problem Source and target binaries use different PCs On indirect jump, PC is source, but jump should go to target x86 source code movl %eax, 4(%esp) ;load jump address from memory jmp %eax ;jump indirect through %eax PowerPC target code addi r16,r11,4 ;compute x86 address lwzx r4,r2,r16 ;get x86 jump address ; from x86 memory image mtctr r4 ;move to count register bctr ;jump indirect through ctr Emulation

Simplified Special Cases Fixed Mapping All instructions fixed length on known boundaries Like many RISCs Data and pads may be translated, but that’s OK Data will come from original source locations (later) Designed In No jumps or branches to arbitrary locations All jumps via procedure (method) call No data or pads mixed with instructions Java has these properties All code in a method can be “discovered” when method is entered Emulation

Dynamic Translation First Interpret Translate Code Emulation process And perform code discovery as a byproduct Translate Code Incrementally, as it is discovered Place translated blocks into Code Memory Save source to target PC mappings in lookup table Emulation process Execute translated block to end Lookup next source PC in table If translated, jump to target PC Else interpret and translate Emulation

Dynamic Translation Emulation Emulation Manager SPC to TPC Lookup source binary Translation Memory SPC to TPC Lookup Table hit miss translator Interpreter Emulation

Dynamic Blocks Initially, Dynamic Basic Blocks Later, larger blocks containing branches Static Dynamic Basic Blocks Basic Blocks add... add... load... block 1 load... store ... store ... loop: load ... loop: load ... block 1 add ..... block 2 add ..... store store brcond skip brcond skip load... load... block 3 sub... sub... block 2 skip: add... skip: add... store block 4 store brcond loop brcond loop add... loop: load ... load... add ..... block 3 store... block 5 store jump indirect brcond skip ... skip: add... ... store block 4 brcond loop Emulation ... ...

Flow of Control Control flows between translated blocks and Emulation Manager translation block translation Emulation Manager block translation block Emulation ... ...

Tracking the Source PC General Method: Can always update SPC as part of translated code Better to place SPC in stub Code Block Branch and Link to EM Next Source PC Emulation Manager Hash Table General Method: Translator returns to EM via BL Source PC placed in stub immediately after BL EM can then use link register to find source PC and hash to next target code block Emulation ... ...

Emulation Manager Flowchart ... ...

Example Emulation x86 Binary PowerPC Translation 4FD0: addl %edx,(%eax) ;load and accumulate sum movl (%eax),%edx ;store to memory sub %ebx,1 ;decrement loop count jz 51C8 ;branch if at loop end 4FDC: add %eax,4 ;increment %eax jmp 4FD0 ;jump to loop top 51C8: (%ecx),%edx ;store last value of %edx xorl %edx,%edx ;clear %edx 6200 ;jump elsewhere x86 Binary PowerPC Translation Emulation

Example Stub BAL to Emulation Mgr. Branch is taken to stub HASH TABLE SPC 51C8 TPC 9C08 link ////// PowerPC Translation 9AC0: lwz r16,0(r4) ;load value from memory add r7,r7,r16 ;accumulate sum 1 stw r7,0(r5) ;store to memory addic. r5,r5,-1 ;decrement loop count, set cr0 beq cr0,pc+12 ;branch if loop exit 6 2 bl F000 ;branch & link to EM 4FDC ;save source PC in link register 8 9AE4: bl F000 3 ;branch & link to EM 51C8 ;save source PC in link register Emulation Manager 4 9C08: stw r7,0(r6) ;store last value of %edx xor r7,r7,r7 ;clear %edx 10 bl F000 ;branch & link to EM 5 6200 ;save source PC in link register 9 7 4 3 5 2 7 Stub BAL to Emulation Mgr. Branch is taken to stub Branch to transfer code Translated basic block is executed 6 10 8 EM loads SPC from stub, using link Continue execution EM loads SPC from hash tbl; compares Load TPC from hash table Jump indirect to next translated block EM hashes SPC and does lookup 9 1 Emulation

Translation Chaining Jump from one translation directly to next Avoid switching back to EM Without Chaining: With Chaining: translation translation block block translation translation block Figs3/nochain figs3/chain VMM block VMM translation translation block block translation block Emulation

Creating a Link get next SPC Lookup Successor Set up chain 1 2 Predecessor Lookup 3 Successor Jump TPC JAL TM 4 next SPC 5 Set up Figs3/nochain figs3/chain chain Successor Emulation

Example Figs3/nochain figs3/chain Emulation

Translation Chaining Translation-terminating branch is Simple Solution Unconditional: jump directly to next translation Conditional: link each exit separately Indirect jumps: Simple Solution Always return to EM Dynamic Chaining Software jump prediction translation block VMM Figs3/chain Emulation

Software Jump Prediction Form of “Inline Caching” Example Code: Say Rx holds source branch address addr_i are predicted addresses (in probability order) Determined via profiling target_i are corresponding target code blocks If Rx == addr_1 goto target_1 Else if Rx == addr_2 goto target_2 Else if Rx == addr_3 goto target_3 Else hash_lookup(Rx) ; do it the slow way Emulation

Call/Returns Calls/Returns use source stack for PC save/restore Stack contents can be changed between call and return Causes problem with return PC on stack is source PC Need target PC  do hash table lookup on every return? Solution: Use Shadow Stack Emulation

Shadow Stack Integrate into target stack, maintain with translated code Calling routine Allocates shadow stack frame Saves source return address and source stack pointer Called routine Saves target address in caller’s shadow stack frame On return Called routine checks source return address and stack pointer against saved version in shadow frame If no change, use target return address Else return to EM; apply source return address to hash table Emulation

Shadow Stack Example x86 stack shadow stack copy on call x86 return address copy on call return address PPC return address if compare succeeds x86 stack pointer x86 stack pointer x86 return address compare on return Emulation

Other Important Issues… To be covered later: Self-modifying Code Self-referencing Code Precise Traps Emulation

Source/Target ISA Issues Register architectures Condition codes Data formats and operations Floating point Decimal MMX Address resolution Byte vs Word addressing Address alignment Natural vs arbitrary Byte order Big/Little endian Emulation

Registers Local Allocation Global Allocation Regional Allocation Assign target to source registers for one translation Best case: each target register read/written only once per translation Global Allocation Assign target registers permanently; e.g. PC, stack pointer Regional Allocation Registers assignment fixed for a region of translations Check to see which assignment is in effect Write back registers when region changes If #source registers << #target registers Assign all registers permanently Converse => many save/restores Emulation

Condition Codes Both source and target have CCs Target CCs saved/restored frequently e.g. may be implicitly set If different CC semantics, may be simpler to save input values and operation. Then generate CCs only when actually needed. Source has no CCs; target does Target emulates compare/branch Some comparisons may require multiple compare/branches Source has CCs; target does not Target must emulate CCs each time they are needed Or save input values and perform lazy compare/branch when needed Emulation

Lazy Condition Code Evaluation add %ecx,%ebx jmp label1 . . . label1: jz target R4  eax PPC to R5  ebx x86 register R6  ecx mappings . R24  scratch register used by emulation code R25  condition code operand 1 ;registers R26  condition code operand 2 ;used for R27  condition code operation ;lazy condition ;emulation code R28  jump table base address Emulation

Lazy Condition Code Evaluation mr r25,r6 ;save operands mr r26,r5 ; and opcode for li r27,“add” ; lazy condition code emulation add r6,r6,r5 ;translation of add b label1 ... label1: bl genZF ;branch and link genZF code beq cr0,target ;branch on condition flag genZF: add r29,r28,r27 ;add “opcode” to jump table base mtctr r29 ;copy to counter register bctr ;branch via jump table ... ... “add”: add. r24,r25,r26 ;perform PowerPC add, set cr0 blr ;return Emulation

Data Formats Different floating point formats Convert source values to target values Re-convert prior to stores and at translation boundaries Results will not be identical For identical results, detailed emulation is required (very expensive) Same floating point formats No conversion problem IEEE compliant => no problem with results, either Multimedia (MMX) Usually simple data types Per-instruction pre-translated libraries may be best solution Emulation

Address Resolution Byte addressing vs. Word addressing Not many word addressed machines left Bytes on a word machine Lots of shifting and masking Words on a byte machine Low order address bits maintained at zero May be address space problems byte machine has smaller total address space Emulation

Address Alignment Some ISAs allow addresses to be aligned on arbitrary byte boundaries; “Unaligned” Others only allow “natural” alignment Unaligned source on natural aligned target has high overhead Multiple loads/stores, shifts, logicals Test alignment first, and only perform multiple ops if unaligned Can generate two translation bodies, check alignment first, then branch to proper body Many aligned machines already have provisions for unaligned data Emulation

Byte Order Target code must perform byte-reordering Can be very inefficient 8-10 target instructions per host load Some target ISAs may support both E.g. MIPS, IA-64 Emulation

Case Study: SUN Shade A fast instruction set simulator Good case study for understanding important issues Uses binary translation for speed uncommon in simulators typically interpretation is used Paper contains interesting discussion regarding simulation We focus on binary translation aspect Emulation

Shade Code Structures TC: Translation Cache VMEM TLB TC Text Data Figs3/shadest TC: Translation Cache TLB: Translation Lookaside Buffer (Hash Table) vpc: virtual (source) PC; held in a register Stack Emulation

Shade Translations Translation Units Stops at special instructions Basic block or superblock Some source code fragments may be in more than one translation block Stops at special instructions Synchronization, indirect branches, software traps Limit on translation block size To simplify storage allocation Translation Cache Size 4MB Emulation

Shade TLB (Hash Table) Maps source PC to cached translation (unfortunate, confusing name) An array of lists (e.g. n-way set assoc) Each list is n <source,target> pairs TLB Size 256KB (32K entries) Figs3/shhash Emulation

Shade Main Simulation Loop Hashes vpc (source PC) to TLB array index Fast partial search of TLB entry for vpc value If hit, use target translation If miss, call function to do full search of TLB entries If hit, move hit entry to front of hash list If miss, generate translation, place at front of list Finding Successor block Code block places next source PC in vpc then jumps to VMM Emulation

Shade Chaining Setup Successor translated first: Predecessor is translated with branch to chain pointer built in Translator can use hash table to find chain pointer Predecessor translated first: Block finishes with BAL to VMM Link register contains PC of BAL instruction (slot where a chaining branch should go) Translator overwrites BAL with branch to successor Indirect Jump Chaining: Not used; always return to VMM BAL to VMM get next src PC Predecessor Set up chain B trgt PC BAL VMM Translate Successor next src PC Figs3/chain1 modified Emulation

Emulation Summary Decode/Dispatch Interpretation source code dispatch loop interpreter routines "data" accesses Decode/Dispatch Interpretation Memory Requirements: Low Startup: Fast Steady State Perf.: Slow Portability: Excellent Scanned from shade paper Emulation

Emulation Summary Indirect Threaded Interpretation source code interpreter routines Indirect Threaded Interpretation Memory Requirements: Low Slightly more than decode/dispatch due to replicated dispatch code Startup: Fast Steady State Perf.: Slow Portability: Excellent Scanned from shade paper Emulation

Emulation Summary Direct Threaded Interpretation source code pre- decoder interpreter routines intermediate code Direct Threaded Interpretation Memory Requirements: High Startup: Slow Steady State Perf. : Medium Portability: Medium Predecoded version is location-specific Scanned from shade paper Emulation

Emulation Summary Binary Translation Memory Requirements: High source code binary translator binary translated target code Binary Translation Memory Requirements: High Startup: Very Slow Steady State Perf.: Fast Portability: Poor Scanned from shade paper Emulation