Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

Slides:

Advertisements

Similar presentations

JUST-IN-TIME COMPILATION

Advertisements

Intro to the “c6x” VLIW processor

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple.

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Transmeta’s Crusoe Architecture Umran A. Khan Microprocessors.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

Improving IPC by Kernel Design Jochen Liedtke Shane Matthews Portland State University.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Multiscalar processors

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

TECH 6 VLIW Architectures {Very Long Instruction Word}

INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

CS5222 Advanced Computer Architecture Part 3: VLIW Architecture

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Transmeta’s New Processor Another way to design CPU By Wu Cheng

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

1 Aphirak Jansang Thiranun Dumrongson

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.

Advanced Architectures

Instruction Level Parallelism

Virtual Memory - Part II

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

INTEL HYPER THREADING TECHNOLOGY

5.2 Eleven Advanced Optimizations of Cache Performance

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Henk Corporaal TUEindhoven 2009

Instructional Parallelism

Instruction Scheduling for Instruction-Level Parallelism

Superscalar Processors & VLIW Processors

Yingmin Li Ting Yan Qi Zhao

Henk Corporaal TUEindhoven 2011

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

How to improve (decrease) CPI

Instruction Level Parallelism

Presentation transcript:

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz

Stuff Compilers Don’t (Can’t?) Do Instruction reordering Common case detection and optimization – Branch prediction – Traces ( pre-fetching ) – Optimizing traces Why can’t compilers do these optimizations? – No runtime statistics – Legacy code ( inertia to recompile )

Therefore – Dynamic Code Optimization Optimize on the fly ( runtime ) Current processors do it to some extent – Instruction reordering – Branch prediction You can do much better…

How Do You Implement This? “Hardware Intensive” approach – Pentium Pro Instruction Translator – Part of the critical path of the main processor – I-COP Instruction-block Optimizer – Off the critical path “Non-Hardware Intensive” approach – Transmeta, DAISY, Java HotSpot Trade-offs ?

I-COP (Instruction Path Coprocessors) What? – Add another processor that watches the instructions retire and can perform operations on them Why? – Performance! Principles – Keep the optimizations out of the critical path – Avoid slowdown due to software

Structure Multiple VLIW processor “slices” makes the I-COP simple, but still able to keep up I-COP slices have 10 special instructions for pattern matching in addition to 12 normal RISC type

Applications of I-COP Trace cache fill – Find long strings of instructions that are executed frequently Pre-fetching – Find a load that is used later as an address in another load Instruction trace optimizations – Register move optimization

The I-COP Processor Multiple VLIW slices allow multi-level statically scheduled and explicitly encoded parallelism Predication and delay slots obviate branch prediction 32 integer registers, 8 predicate registers 22 instructions, 12 RISC type, and 10 special – Pattern matching, bit manipulation, instrumentation Fill buffer collects instructions for analysis Task queue acts as FIFO scheduler

The I-COP Processor Cont.

Examples Of Special Instructions SearchReplace – Finds a given pattern and replaces it with another given pattern, returns the number of replacements accomplished Subset – Tests if the bits set in a given register are a subset of those set in a second register

Transmeta Crusoe The best example of a “non-hardware-intensive” approach New (and fast!) 128-bit VLIW processor Aimed at systems where power efficiency is important – Mobile systems – “Dense” servers Therefore, small gate count BUT, need x86 compatibility AND, at reasonable performance too

So how do they do it? Have a “Code-Morphing” software layer that runs on the processor All x86 software (BIOS, OS, apps) runs above this CM software translates x86 code at runtime into VLIW processor’s native IS Also optimizes the translations! So processor is fast and simple

Cheesy Marketing Image

Code-Morphing Software Translates an entire basic-block at once Also does instruction re-ordering, branch prediction, register renaming The translations are stored in a translation cache (part of main memory) Instruments code to help with branch prediction, and detecting candidates for heavy optimizations

Code Morphing Software (cont.) Also has some help from the hardware Shadowed and working register sets Alias hardware (load-and-protect operations) “Translated” bit for each page table entry Performance of systems with Crusoe: 2-3 times longer battery life, performance “comparable” to Intel mobile processors