JUST-IN-TIME COMPILATION

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.

Programming Technologies, MIPT, April 7th, 2012 Introduction to Binary Translation Technology Roman Sokolov SMWare

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Exploiting ILP with Software Approaches

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Transmeta’s Crusoe Architecture Umran A. Khan Microprocessors.

1 Operating Systems and Protection CS Goals of Today’s Lecture How multiple programs can run at once  Processes  Context switching  Process.

Instruction Level Parallelism (ILP) Colin Stevens.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Asanovic/Devadas Spring Virtual Machines and Dynamic Translation: Implementing ISAs in Software Krste Asanovic Laboratory for Computer Science.

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Hardware Support for Compiler Speculation

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Transmeta’s New Processor Another way to design CPU By Wu Cheng

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Full and Para Virtualization

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

1 Aphirak Jansang Thiranun Dumrongson

Chapter Overview General Concepts IA-32 Processor Architecture

CS 352H: Computer Systems Architecture

Instruction Set Architecture

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Computer Architecture Chapter (14): Processor Structure and Function

IA32 Processors Evolutionary Design

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

5.2 Eleven Advanced Optimizations of Cache Performance

Henk Corporaal TUEindhoven 2009

Stamatis Vassiliadis Symposium Sept. 28, 2007 J. E. Smith

OS Virtualization.

Instruction Scheduling for Instruction-Level Parallelism

Superscalar Processors & VLIW Processors

Hardware Multithreading

Yingmin Li Ting Yan Qi Zhao

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

How to improve (decrease) CPI

Henk Corporaal TUEindhoven 2011

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CSC3050 – Computer Architecture

Introduction to Virtual Machines

Lecture 8: Efficient Address Translation

Introduction to Virtual Machines

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

JUST-IN-TIME COMPILATION Lessons Learned From Transmeta Thomas Kistler thomas.kistler@smachines.com

Industry Observation Shift away from proprietary closed platforms to open architecture-independent platforms Shift away from native code to portable code All these platforms make heavy use of just-in-time compilation JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Industry Observation Examples Android & Dalvik VM Chromium OS & PNaCl Java ME/SE/EE & Java VM HTML5 & JavaScript Microsoft .NET & CLR VMWare JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

The Transmeta Architecture Part I The Transmeta Architecture JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Transmeta’s Premise Superscalar out-of-order processors are complicated Lots of transistors Increased power consumption Increased die area Increased cost Do not scale well JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Transmeta’s Idea Build simple in-order VLIW processor with Code Morphing Software More efficient in area, cost and power Performance of out-of-order architecture through software optimization JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

VLIW Architecture In superscalar architectures, the execution units are invisible to the instruction set. The instruction set is independent of the micro-architecture. In VLIW architectures, the execution units are visible to the instruction set. A VLIW instruction encodes multiple operations; specifically, one operation for each execution unit. The instruction set is closely tied to the micro-architecture. No or limited hardware interlocks. The compiler is responsible for correct scheduling. No forward and backward compatibility. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Software Architecture x86 Code Interpreter JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Software Architecture x86 Code Translated? Yes No Code Cache VLIW Code Hot? Yes Just-in-Time Compiler VLIW Code VLIW Code No Interpreter JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Software Architecture x86 Code Translated? Yes No Code Cache VLIW Code Hot? Yes Just-in-Time Compiler VLIW Code VLIW Code No Interpreter JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Translation Example Original Code Translated Code Optimized Code ld %r30,[%esp] add.c %eax,%eax,%r30 ld %r31,[%esp] add.c %ebx,%ebx,%r31 ld %esi,[%ebp] sub.c %ecx,%ecx,5 ld %r30,[%esp] add %eax,%eax,%r30 add %ebx,%ebx,%r30 ld %esi,[%ebp] sub.c %ecx,%ecx,5 { ld %r30,[%esp]; sub.c %ecx,%ecx,5 } { ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30 } addl %eax,(%esp) addl %ebx,(%esp) movl %esi,(%ebp) subl %ecx,5 Original Code Translated Code Optimized Code VLIW Code JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Software Advantages Moves complexity from hardware to software. Can optimize a large group of instructions. Optimization cost is amortized. Out-of-order hardware pays the cost every single time. Avoids legacy code problem. More speculation is possible with proper hardware support. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Speculation Problem Exceptions are precise. What if ld faults? The sub executes out-of-order. { ld %r30,[%esp]; sub.c %ecx,%ecx,5 } { ld %esi,[%ebp]; add %eax,%eax,%r30; add %ebx,%ebx,%r30 } addl %eax,(%esp) addl %ebx,(%esp) movl %esi,(%ebp) subl %ecx,5 Original Code VLIW Code JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Speculation Solution Commit All registers are shadowed (working and shadow copy). Normal instructions only update working copy. When translation is done, a commit instruction is issued and all working registers are copied to their shadows Rollback If an exception happens, a rollback instruction is issued. All shadow registers are copied to the working set and software re-executes the x86 code conservatively JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Speculation Problem It is hard to prove that load and store addresses do not conflict. { ld %r30,[%esp] } { st %eax,[%esi] } { ld %r31,[%esp] } { st %eax, [%esp] } { ld %esi, [%ebp] } { sub.c %esi,%esi,5 } Load Speculation Load/Store Elimination Moving loads above stores can be a big scheduling benefit. Eliminate redundant loads. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Speculation Solution Load And Protect Loads are converted to load-and-protect. They record the address and data size of the load and create a protected region. Store Under Alias Mask Stores are converted to store-under-alias-mask. They check for protected regions and raise an exception if there is an address match. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Speculation Load Speculation Load/Store Elimination { ld %r30,[%esp] } { st %eax,[%esi] } { ld %r31,[%esp] } { st %eax, [%esp] } { ld %esi, [%ebp] } { sub.c %esi,%esi,5 } Load Speculation Load/Store Elimination { ldp %r30,[%esp] } { stam %eax,[%esi] } { copy %r31,%r30 } { ldp %esi, [%ebp] } { stam %eax, [%esp] } { sub.c %esi,%esi,5 } JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Self-Modifying Code Problem What if x86 code changes dynamically? Existing translations are probably wrong! JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Self-Modifying Code Solution T-Bit Protection Software write-protects the pages of x86 memory containing translations with a special T-bit. Hardware faults for writes to T-bit protected pages. Software then invalidates all translations on that page. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Self-Modifying Code Different types of self-modifying code Windows BitBlt Shared code and data pages Code that patches offsets and constants Just-in-time compilers. Generating code in code cache, garbage collecting code, patching code, etc. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Lessons Learned Part II JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Software Out-of-Order Questions Can software speculation using commit/rollback and load-and-protect/store-under-mask significantly improve performance over traditional in-order architectures? Can software speculation eliminate memory stalls (memory stalls are very expensive in modern CPU architectures) to compete with out-of-order architectures? JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Lesson 1 Software speculation cannot compete with true out-of-order performance in terms of raw performance. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Snappiness Questions What is the relationship between translation overhead and performance? What is the relationship between snappiness and steady-state performance? JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Gears Overview 1st Gear (Interpreter) 2nd Gear 3rd Gear Executes one instruction at a time. Gathers branch frequencies and direction. No startup cost, lowest speed. 2nd Gear Initial translation. Light optimization, simple scheduling. Low translation overhead, fast execution. 3rd Gear Better translations. Advanced optimizations. High translation overhead, fastest execution. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Gears Costs Startup Cost (Cycles/Instruction) Performance Trigger Point (# Executions) Gear 1 100.0 - Gear 2 8,000 1.5 50 Gear 3 16,000 1.2 10000 JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Gears CPI (Clocks per Instruction) JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Static Analysis JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Dynamic Analysis JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Dynamic Analysis – Cumulative JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Cycle Analysis JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Cycle Analysis JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Cycle Analysis – Alternative I JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Application Behavior Cycle Analysis – Alternative II JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Lesson 2 The first-level gear is incredibly important for perceived performance and snappiness. The interpreter is not good enough. Higher-level gears are incredibly important for steady-state performance. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Interrupt Latency Questions When do we generate the translated code and how do we interrupt the “main x86 thread”? How does the design of the translator affect real-time response times or interrupt latencies? How does the design of the translator affect soft real time applications? JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Interrupt Latency Transmeta’s Answer The main “x86 thread” is interrupted and the translation is generated in-place. The main “x86 thread” then resumes. Problem Generating a highly optimized translation can consume millions of cycles, during which the system appears unresponsive. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Lesson 3 The design of a just-in-time compiler must be multi-threaded. The system must guarantee a certain amount of main “x86 thread” forward progress. The optimization thread(s) must run in the background (or on a different core) and must be preemptable. JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

Part III Questions? JUST-IN-TIME COMPILATION Lessons Learned From Transmeta

We are Always Looking for Talent! SMWare We are Always Looking for Talent! Thomas Kistler thomas.kistler@smachines.com