Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Slides:



Advertisements
Similar presentations
Chapt.2 Machine Architecture Impact of languages –Support – faster, more secure Primitive Operations –e.g. nested subroutine calls »Subroutines implemented.
Advertisements

Instruction Set Design
1 Lecture 3: MIPS Instruction Set Today’s topic:  More MIPS instructions  Procedure call/return Reminder: Assignment 1 is on the class web-page (due.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Lecture 3: Instruction Set Principles Kai Bu
ELEN 468 Advanced Logic Design
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Distributed Systems CS Virtualization- Part II Lecture 24, Dec 7, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Computer Organization and Architecture The CPU Structure.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Chapter 12 CPU Structure and Function. Example Register Organizations.
CSc 453 Interpreters & Interpretation Saumya Debray The University of Arizona Tucson.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
Instruction Set Architecture
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
Virtual Machines, Interpretation Techniques, and Just-In-Time Compilers Kostis Sagonas
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
Lecture 04: Instruction Set Principles Kai Bu
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
CS 598 Scripting Languages Design and Implementation 12. Interpreter implementation.
Instruction Set Architectures Continued. Expanding Opcodes & Instructions.
Displacement (Indexed) Stack
15-740/ Computer Architecture Lecture 3: Performance
Dynamic Compilation Vijay Janapa Reddi
Computer Architecture Instruction Set Architecture
Immediate Addressing Mode
Chapter 11 Instruction Sets
A Closer Look at Instruction Set Architectures
Computer Architecture Instruction Set Architecture
System Programming and administration
Choices in Designing an ISA
Lecture 4: MIPS Instruction Set
ELEN 468 Advanced Logic Design
Emulation: Interpretation and Binary Translation
A Closer Look at Instruction Set Architectures
Prof. Sirer CS 316 Cornell University
William Stallings Computer Organization and Architecture 8th Edition
Instruction Set Architectures
Lecture 4: MIPS Instruction Set
CS170 Computer Organization and Architecture I
CSCE Fall 2013 Prof. Jennifer L. Welch.
CSc 453 Interpreters & Interpretation
The University of Adelaide, School of Computer Science
MARIE: An Introduction to a Simple Computer
Computer Architecture and the Fetch-Execute Cycle
Instruction Set Architectures Continued
Classification of instructions
Computer Instructions
Chapter 4 The Von Neumann Model
CSCE Fall 2012 Prof. Jennifer L. Welch.
Evolution of ISA’s ISA’s have changed over computer “generations”.
Prof. Sirer CS 316 Cornell University
Introduction to Microprocessor Programming
Instruction Set Principles
Dr Hao Zheng Computer Sci. & Eng. U of South Florida
Evolution of ISA’s ISA’s have changed over computer “generations”.
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Evolution of ISA’s ISA’s have changed over computer “generations”.
Lecture 4: Instruction Set Design/Pipelining
Chapter 11 Processor Structure and function
CS501 Advanced Computer Architecture
CSc 453 Interpreters & Interpretation
Evolution of ISA’s ISA’s have changed over computer “generations”.
Chapter 4 The Von Neumann Model
Presentation transcript:

Vijay Janapa Reddi The University of Texas at Austin Interpretation 2 Dynamic Compilation Vijay Janapa Reddi The University of Texas at Austin Interpretation 2

Class Objectives Perform a quick review of interpretation Discuss other forms of interpretation beyond decode and dispatch Threading Predecoding Identify the challenges of static versus dynamic interpretation Talk a bit about binary translation

Going from abstract byte code to native code generation We have a machine description (e.g. Java VM, Intel x86, etc. . . ) How can we emulate its behavior?

Two Fundamental Approaches Interpretation Sometimes called Emulation Binary translation Hybrid of Interpretation and Binary translation

Interpreter State An interpreter needs to maintain the complete architected state of the machine implementing the source ISA Registers Memory Code Data stack Program Counter Condition Codes Reg 0 Reg 1 . Reg n-1 Interpreter Code Code Data Stack .

Decode-Dispatch Interpreter Decode and dispatch interpreter step through the program one instruction at a time decode the current instruction dispatch to corresponding interpreter routine very high interpretation cost while (!halt && !interrupt) { inst = code[PC]; opcode = extract(inst,31,6); switch(opcode) { case LoadWordAndZero: LoadWordAndZero(inst); case ALU: ALU(inst); case Branch: Branch(inst); … }

Interpreter Behavior Fetch the opcode Go to the right “switch case” Call the right routine Emulate the bytecode Return to the caller Restart Decode-Dispatch Loop mostly serial code case statement call to function routine return

Decode – Dispatch Efficiency Decode-Dispatch Loop mostly serial code some “problematic” code behavior case statement call to function routine return Executing an add instruction approximately 20 target instructions several loads/stores and shift/mask steps Hand-coding can lead to better performance Eg: DEC/Compaq FX!32

Threaded Interpretation Introduced by James R. Bell, 1973 interpreter_add(...) { … opcode = get_next_opcode() f = dispatch[opcode] call f } interpreter_sub(...) { Where is the optimization?

Threaded Interpreter Behavior Can we optimize more?

Threaded Interpretation interpreter_add (...) { … opcode = get_next_opcode() f = dispatch[opcode] call f } interpreter_sub (...){ Can we optimize more?

Indirect Threaded Interpretation High number of branches in decode-dispatch interpretation reduces performance Overhead of 5 branches per instruction Threaded interpretation improves efficiency by reducing branch overhead Append dispatch code with each interpretation routine Removes 3 branches Threads together function routines

Indirect Threaded Interpretation (2) LoadWordAndZero (inst) { RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32)>> 32; PC = PC + 4; } LoadWordAndZero: RT = extract(inst,25,5); RA = extract(inst,20,5); displacement = extract(inst,15,16); if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs(RT) = (data(address)<< 32) >> 32; PC = PC + 4; if (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6) extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

Indirect Threaded Interpretation (3) Add: RT = extract(inst,25,5); RA = extract(inst,20,5); RB = extract(inst,15,5); source1 = regs[RA]; source2 = regs[RB]; sum = source1 + source2 ; regs[RT] = sum; PC = PC + 4; if (halt || interrupt) goto exit; inst = code[PC]; opcode = extract(inst,31,6); extended_opcode = extract(inst,10,10); routine = dispatch[opcode,extended_opcode]; goto *routine;

Indirect Threaded Interpretation (4) Dispatch occurs indirectly through a table Interpretation routines can be modified and relocated independently Advantages Binary intermediate code still portable Improves efficiency over basic interpretation Disadvantages Code replication increases interpreter size

Indirect Threaded Interpretation (5) source code interpreter routines "data" accesses dispatch loop Decode-dispatch Threaded

Predecoding Parse each instruction into a pre-defined structure to facilitate interpretation separate opcode, operands, etc. reduces shifts / masks significantly What ISA is this more useful for?! changes to input binary damages portability 07 (load word and zero) 1 2 08 08 (add) 3 03 37 (store word) 4 00 lwz add stw r1, 8(r2) r3, r3,r1 r3, 0(r4)

Predecoding (2) struct instruction { unsigned long op; unsigned char dest, src1, src2; } code [CODE_SIZE]; Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address] << 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; if (halt || interrupt) goto exit; opcode = code[TPC].op routine = dispatch[opcode]; goto *routine;

Direct Threaded Interpretation Allow even higher efficiency by Removing the memory access to the table Requires pre-decoding Dependent on locations of interpreter routines (loses portability) 07 (load word and zero) 1 2 08 08 (add) 3 03 37 (store word) 4 00 001048d0 (load word and zero) 1 2 08 00104800 (add) 3 03 00104910 (store word) 4 00

Direct Threaded Interpretation (2) Predecode the source binary into an intermediate structure Replace the opcode in the intermediate form with the address of the interpreter routine Remove the memory lookup of the dispatch table Limits portability since exact locations of the interpreter routines are needed

Direct Threaded Interpretation (3) Load Word and Zero: RT = code[TPC].dest; RA = code[TPC].src1; displacement = code[TPC].src2; if (RA == 0) source = 0; else source = regs[RA]; address = source + displacement; regs[RT] = (data[address]<< 32) >> 32; SPC = SPC + 4; TPC = TPC + 1; if (halt || interrupt) goto exit; routine = code[TPC].op; goto *routine;

Direct Threaded Interpretation (4) source code interpreter routines intermediate code pre- decoder

Some Challenges Complexity of the ISA can be an issue Easy Hard P-code Java bytecodes Fixed-length instruction set Hard Variable-length instruction sets Opcode lengths vary, operand locations vary Code x Data

x86 CISC Instruction Complexity

Predecoding the CISC ISA struct IA-32instr { unsigned short opcode; unsigned short prefixmask; char ilen; / / instruction length. InterpreterFunctionPointer execute; / / semantic routine for this i n s t r . struct { / / general address computation: [Rbase + (Rindex << shmt) + displacement] char mode; / / operand addressing mode, including register operand. char Rbase; / / base address register char Rindex; / / index register char shmt; / / index scale factor long displacement; } operandRM; char mode; / / either register or immediate. char regname; / / register number long immediate;// immediate value } operandRI ; } i n s t r ; Predecoding will lead to a very large program size!!!

Big Fetch-Decode Table IA-32OpcodeInfo_t IA-32_fetch_decode_table[] = { { DecodeAction, InterpreterFunctionPointer}, { DecodeAction, InterpreterFunctionPointer}, { DecodeAction, InterpreterFunctionPointer}, … };

Decode Dispatch Loop for x86

Decode Dispatch Loop for x86 (2)

Decode Dispatch Loop for x86 (3)

Could Apply Traditional Interpreter Control Flow to x86 Decode for CISC ISA Individual routines for each instruction General Decode (fill-in instruction structure) Dispatch ... Inst. 1 specialized routine Inst. 2 specialized routine Inst. n specialized routine

Or… We Could Make the Common Case Fast For CISC ISAs Multiple byte opcode versus single byte opcode Prefix versus no prefix

Make the Common Case Fast!

Optimizing x86 via Pre-Decoded Direct Threaded Interpretation source code interpreter routines intermediate code pre- decoder

Code Discovery Problem May be difficult to statically translate or predecode the entire source program Consider x86 code mov %ch, 0 ?? 31 c0 8b b5 00 00 03 08 8b bd 00 00 03 00 movl %esi, 0x08030000 (%ebp) ??

Code Discovery Problem (2) Contributors to code discovery problem variable-length (CISC) instructions indirect jumps data interspersed with code padding instructions to align branch targets Source ISA instructions inst. 1 inst. 2 inst. 3 jump reg. data inst. 5 inst. 6 uncond. brnch pad inst. 8 data in instruction stream pad for instruction alignment jump indirect to???

Simplified Solutions Fixed-width RISC ISA are always aligned on fixed boundaries Use special instruction sets (Java) no jumps/branches to arbitrary locations no data or pads mixed with instructions all code can then be discovered Use incremental approaches

Class Problem Generate a summary table that succinctly captures the pros and cons of the various schemes

Reading Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters Authors: Kevin Casey, M. Anton Ertl, Tu Wien and David Gregg (Toplas 2005) Uploaded (ready for review today) Due at the time of your interpreter homework 30 pages.