Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Vijay Janapa Reddi The University of Texas at Austin Interpretation 2
Dynamic Compilation Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Class Objectives Perform a quick review of interpretation
Discuss other forms of interpretation beyond decode and dispatch Threading Predecoding Identify the challenges of static versus dynamic interpretation Talk a bit about binary translation

Going from abstract byte code to native code generation
We have a machine description (e.g. Java VM, Intel x86, etc. . . ) How can we emulate its behavior?

Two Fundamental Approaches
Interpretation Sometimes called Emulation Binary translation Hybrid of Interpretation and Binary translation

Interpreter State An interpreter needs to maintain the complete architected state of the machine implementing the source ISA Registers Memory Code Data stack Program Counter Condition Codes Reg 0 Reg 1 . Reg n-1 Interpreter Code Code Data Stack .

Decode-Dispatch Interpreter
Decode and dispatch interpreter step through the program one instruction at a time decode the current instruction dispatch to corresponding interpreter routine very high interpretation cost while (!halt && !interrupt) { inst = code[PC]; opcode = extract(inst,31,6); switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); … }

Interpreter Behavior Fetch the opcode Go to the right “switch case”
Call the right routine Emulate the bytecode Return to the caller Restart Decode-Dispatch Loop mostly serial code case statement call to function routine return

Decode – Dispatch Efficiency
Decode-Dispatch Loop mostly serial code some “problematic” code behavior case statement call to function routine return Executing an add instruction approximately 20 target instructions several loads/stores and shift/mask steps Hand-coding can lead to better performance Eg: DEC/Compaq FX!32

Threaded Interpretation
Introduced by James R. Bell, 1973 interpreter_add(...) { … opcode = get_next_opcode() f = dispatch[opcode] call f } interpreter_sub(...) { Where is the optimization?

Threaded Interpreter Behavior
Can we optimize more?

Threaded Interpretation
interpreter_add (...) { … opcode = get_next_opcode() f = dispatch[opcode] call f } interpreter_sub (...){ Can we optimize more?

Indirect Threaded Interpretation
High number of branches in decode-dispatch interpretation reduces performance Overhead of 5 branches per instruction Threaded interpretation improves efficiency by reducing branch overhead Append dispatch code with each interpretation routine Removes 3 branches Threads together function routines

Indirect Threaded Interpretation (2)
LoadWordAndZero (inst) { RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC ; } LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC + 4; if (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6) extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs[RA]; source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; if (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6); extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

Dispatch occurs indirectly through a table Interpretation routines can be modified and relocated independently Advantages Binary intermediate code still portable Improves efficiency over basic interpretation Disadvantages Code replication increases interpreter size

source code interpreter routines "data" accesses dispatch loop Decode-dispatch Threaded

Predecoding Parse each instruction into a pre-defined structure to facilitate interpretation separate opcode, operands, etc. reduces shifts / masks significantly What ISA is this more useful for?! changes to input binary damages portability (load word and zero) 1 2 08 (add) 3 03 (store word) 4 00 lwz add stw r1, 8(r2) r3, r3,r1 r3, 0(r4)

Predecoding (2) struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address] << 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; if (halt || interrupt) goto exit; opcode = code[TPC].op routine = dispatch[opcode]; goto *routine;

Direct Threaded Interpretation
Allow even higher efficiency by Removing the memory access to the table Requires pre-decoding Dependent on locations of interpreter routines (loses portability) (load word and zero) 1 2 08 (add) 3 03 (store word) 4 00 001048d0 (load word and zero) 1 2 08 (add) 3 03 (store word) 4 00

Direct Threaded Interpretation (2)
Predecode the source binary into an intermediate structure Replace the opcode in the intermediate form with the address of the interpreter routine Remove the memory lookup of the dispatch table Limits portability since exact locations of the interpreter routines are needed

Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; if (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine;

source code interpreter routines intermediate code pre- decoder

Some Challenges Complexity of the ISA can be an issue Easy Hard P-code
Java bytecodes Fixed-length instruction set Hard Variable-length instruction sets Opcode lengths vary, operand locations vary Code x Data

x86 CISC Instruction Complexity

Predecoding the CISC ISA
struct IA-32instr { unsigned short opcode; unsigned short prefixmask; char ilen; / / instruction length. InterpreterFunctionPointer execute; / / semantic routine for this i n s t r . struct { / / general address computation: [Rbase + (Rindex << shmt) + displacement] char mode; / / operand addressing mode, including register operand. char Rbase; / / base address register char Rindex; / / index register char shmt; / / index scale factor long displacement; } operandRM; char mode; / / either register or immediate. char regname; / / register number long immediate;// immediate value } operandRI ; } i n s t r ; Predecoding will lead to a very large program size!!!

Big Fetch-Decode Table
IA-32OpcodeInfo_t IA-32_fetch_decode_table[] = { { DecodeAction, InterpreterFunctionPointer}, { DecodeAction, InterpreterFunctionPointer}, { DecodeAction, InterpreterFunctionPointer}, … };

Decode Dispatch Loop for x86

Decode Dispatch Loop for x86 (2)

Decode Dispatch Loop for x86 (3)

Could Apply Traditional Interpreter Control Flow to x86
Decode for CISC ISA Individual routines for each instruction General Decode (fill-in instruction structure) Dispatch ... Inst. 1 specialized routine Inst. 2 specialized routine Inst. n specialized routine

Or… We Could Make the Common Case Fast
For CISC ISAs Multiple byte opcode versus single byte opcode Prefix versus no prefix

Make the Common Case Fast!

Optimizing x86 via Pre-Decoded Direct Threaded Interpretation
source code interpreter routines intermediate code pre- decoder

Code Discovery Problem
May be difficult to statically translate or predecode the entire source program Consider x86 code mov %ch, 0 ?? 31 c0 8b b b bd movl %esi, 0x (%ebp) ??

Code Discovery Problem (2)
Contributors to code discovery problem variable-length (CISC) instructions indirect jumps data interspersed with code padding instructions to align branch targets Source ISA instructions inst. 1 inst. 2 inst. 3 jump reg. data inst. 5 inst. 6 uncond. brnch pad inst. 8 data in instruction stream pad for instruction alignment jump indirect to???

Simplified Solutions Fixed-width RISC ISA are always aligned on fixed boundaries Use special instruction sets (Java) no jumps/branches to arbitrary locations no data or pads mixed with instructions all code can then be discovered Use incremental approaches

Class Problem Generate a summary table that succinctly captures the pros and cons of the various schemes

Reading Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters Authors: Kevin Casey, M. Anton Ertl, Tu Wien and David Gregg (Toplas 2005) Uploaded (ready for review today) Due at the time of your interpreter homework 30 pages.

Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Similar presentations

Presentation on theme: "Vijay Janapa Reddi The University of Texas at Austin Interpretation 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Similar presentations

Presentation on theme: "Vijay Janapa Reddi The University of Texas at Austin Interpretation 2"— Presentation transcript:

Similar presentations

About project

Feedback