EENG449b/Savvides Lec 16.1 3/30/04 March 30, 2004 Prof. Andreas Savvides Spring 2004 EENG 449bG/CPSC 439bG Computer.

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Advertisements

Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

CPSCS614:Graduate Computer Architecture Static Pipelining #2 and Goodbye to Computer Architecture Prof. Lawrence Rauchwerger Based on Lectures by Prof.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

Chapter 21 IA-64 Architecture (Think Intel Itanium)

CS252/Patterson Lec /6/01 CS252 Graduate Computer Architecture Lecture 19: Intro to Static Pipelining April 6, 2001 Prof. David A. Patterson Computer.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Intel IA-64 Architecture Chehun Kim Glenn Ramos. Contents *Pipelining - Stages of pipelining *Microprogramming *Interconnection Structures.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CC 423: Advanced Computer Architecture Limits to ILP

Henk Corporaal TUEindhoven 2009

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

IA-64 Microarchitecture --- Itanium Processor

Lecture 6: Static ILP, Branch prediction

CS 704 Advanced Computer Architecture

Yingmin Li Ting Yan Qi Zhao

Lecture 23: Static Scheduling for High ILP

Henk Corporaal TUEindhoven 2011

Sampoorani, Sivakumar and Joshua

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Loop-Level Parallelism

Lecture 6: Instruction-Level Parallelism with Software Approaches

Presentation transcript:

EENG449b/Savvides Lec /30/04 March 30, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer Systems Lecture 16 Software ILP Hardware Support for Compile-time ILP and Itanium Architecture

EENG449b/Savvides Lec /30/04 Last Time Loop Unrolling Software Pipelining Trace scheduling - incurs cost to the less frequent paths Trace selection: Identify a sequence of basic blocks and put their operations in a smaller set of instructions –Can be done with loops and conditional statements for which some static branch prediction is available –Disadvantage – there is a single entry point and a single exit point to the trace – high overhead Superblocks: Single entry point but multiple exit points –Reduces the overhead of mis-prediction but may result in larger code sizes than trace scheduling

EENG449b/Savvides Lec /30/04 HW Support for Exposing ILP at Compile Time Loop unrolling, software pipelining and Trace scheduling and superblock scheduling – good when braches can be predicted at compile time What if branches are not predictable? –One solution – extend instruction set to include predicated instructions Predicated Instructions – an instruction refers to a condition as part of instruction execution –Execute if condition is true, treat the instruction as a no-op if the condition is false. –Predication transforms control dependences to data dependences

EENG449b/Savvides Lec /30/04 Conditional or Predicated Instructions Example: if (A==0) {S=T}; Assume that A, S, T are stored in R1, R2, R3 The assembly code would be: BNEZ R1,L ADDU R2, R3, R0 L: The new instruction would use a conditional move if the third operand is equal to zero CMOVZ R2,R3,R1 Limitation: Inefficient when trying to eliminate branches that guard the execution of large blocks of code.

EENG449b/Savvides Lec /30/04 Full Predication The execution of all instructions is controlled by a predicate Assume we have a 2-issue architecture First instruction slotSecond instruction slot LW R10,40(R2)ADD R3,R4,R5 ADD R6,R3,R7 BEQZ R10,L LW R8,0(R10) LW R9,0(R8) Waste slot since 3rd LW dependent on result of 2nd LW Idle Slot Stall

EENG449b/Savvides Lec /30/04 Hardware Support for Exposing More Parallelism at Compile-Time Use predicated version load word (LWC)? –load occurs unless the third operand is 0 First instruction slotSecond instruction slot LW R10,40(R2)ADD R3,R4,R5 LWC R8,20(R10),R10ADD R6,R3,R7 BEQZ R10,L LW R9,0(R8) If the sequence following the branch were short, the entire block of code might be converted to predicated execution, and the branch eliminated

EENG449b/Savvides Lec /30/04 Exception Behavior Support Several mechanisms to ensure that speculation by compiler does not violate exception behavior –For example, cannot raise exceptions in predicated code if annulled –Prefetch does not cause exceptions

EENG449b/Savvides Lec /30/04 Summary#1: Hardware versus Software Speculation Mechanisms To speculate extensively, must be able to disambiguate memory references –Much easier in HW than in SW for code with pointers HW-based speculation works better when control flow is unpredictable, and when HW-based branch prediction is superior to SW-based branch prediction done at compile time –Mispredictions mean wasted speculation HW-based speculation maintains precise exception model even for speculated instructions HW-based speculation does not require compensation or bookkeeping code

EENG449b/Savvides Lec /30/04 Summary#2: Hardware versus Software Speculation Mechanisms cont’d Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling HW-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture –may be the most important in the long run?

EENG449b/Savvides Lec /30/04 Summary #3: Software Scheduling Instruction Level Parallelism (ILP) found either by compiler or hardware. Loop level parallelism is easiest to see –SW dependencies/compiler sophistication determine if compiler can unroll loops –Memory dependencies hardest to determine => Memory disambiguation –Very sophisticated transformations available Trace Sceduling to Parallelize If statements Superscalar and VLIW: CPI 1) –Dynamic issue vs. Static issue –More instructions issue at same time => larger hazard penalty –Limitation is often number of instructions that you can successfully fetch and decode per cycle

EENG449b/Savvides Lec /30/04 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture; EPIC is type –EPIC = 2nd generation VLIW? Itanium™ is name of first implementation (2001) –Highly parallel and deeply pipelined hardware at 800Mhz –6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process –Targeted for servers and high end computers bit integer registers bit floating point registers –Not separate register files per functional unit as in old VLIW Hardware checks dependencies (interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

EENG449b/Savvides Lec /30/04 IA-64 Registers The integer registers are configured to help accelerate procedure calls using a register stack –mechanism similar to that developed in the Berkeley RISC-I processor and used in the SPARC architecture. –Registers 0-31 are always accessible and addressed as 0-31 –Registers are used as a register stack and each procedure is allocated a set of registers (from 0 to 96) –The new register stack frame is created for a called procedure by renaming the registers in hardware; –a special register called the current frame pointer (CFM) points to the set of registers to be used by a given procedure 8 64-bit Branch registers used to hold branch destination addresses for indirect branches 64 1-bit predict registers

EENG449b/Savvides Lec /30/04 IA-64 Registers Both the integer and floating point registers support register rotation for registers Register rotation is designed to ease the task of allocating of registers in software pipelined loops When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop –makes the SW-pipelining usable for loops with smaller numbers of iterations, where the overheads would traditionally negate many of the advantages

EENG449b/Savvides Lec /30/04 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” Instruction group: a sequence of consecutive instructions with no register data dependences –All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependences through memory were preserved –An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups IA-64 instructions are encoded in bundles, which are 128 bits wide. –Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent –Smaller code size than old VLIW, larger than x86/RISC –Groups can be linked to show independence > 3 instr

EENG449b/Savvides Lec /30/04 Slide from HP

EENG449b/Savvides Lec /30/04 Slide from HP

EENG449b/Savvides Lec /30/04 5 Types of Execution in Bundle Execution InstructionInstruction Example Unit SlottypeDescriptionInstructions I-unitAInteger ALUadd, subtract, and, or, cmp INon-ALU Intshifts, bit tests, moves M-unitAInteger ALUadd, subtract, and, or, cmp MMemory accessLoads, stores for int/FP regs F-unitFFloating pointFloating point instructions B-unitBBranchesConditional branches, calls L+XL+XExtendedExtended immediates, stops 5-bit template field within each bundle describes both the presence of any stops associated with the bundle and the execution unit type required by each instruction within the bundle (see Fig 4.12 page 354)

EENG449b/Savvides Lec /30/04 FPU IA-32 Control Instr. Fetch & Decode Cache TLB Integer Units IA-64 Control Bus Core Processor Die4 x 1MB L3 cache Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00)

EENG449b/Savvides Lec /30/04 Itanium™ Machine Characteristics (Copyright: Intel at Hotchips ’00) Organic Land Grid ArrayPackage 0.18u CMOS, 6 metal layerProcess 25.4M CPU; 295M L3Transistor Count 800 MHzFrequency 2.1 GB/sec; 4-way Glueless MPSystem Bus 4MB, 4-way s.a., BW of 12.8 GB/sec;L3 Cache Dual ported 96K Unified & 16KD; 16KI L2/L1 Cache 6 / 2 clocksL2/L1 Latency Scalable to large (512+ proc) systems 64 entry ITLB, 32/96 2-level DTLB, VHPT Virtual Memory Support 6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br)Machine Width 3.2 GFlops (DP/EP); 6.4 GFlops (SP) FP Compute Bandwidth 4 DP (8 SP) operands/clock Memory -> FP Bandwidth 14 ported 128 GR & 128 FR; 64 Predicates 32 entry ALAT, Exception Deferral Speculation Registers Branch Prediction Multilevel 4-stage Prediction Hierarchy

EENG449b/Savvides Lec /30/04 Branch Hints Memory Hints Instruction Cache & Branch Predictors Fetch Fetch Memory Subsystem Memory Subsystem Three levels of cache: L1, L2, L3 Register Stack & Rotation Explicit Parallelism 128 GR & 128 FR, Register Remap & Stack Engine RegisterHandling Fast, Simple 6-Issue Issue Control Micro-architecture Features in hardware : Itanium™ EPIC Design Maximizes SW-HW Synergy (Copyright: Intel at Hotchips ’00) : Architecture Features programmed by compiler: Predication Data & Control Speculation Bypasses & Dependencies Parallel Resources 4 Integer + 4 MMX Units 2 FMACs (4 for SSE) 2 L.D/ST units 32 entry ALAT Speculation Deferral Management

EENG449b/Savvides Lec /30/04 10 Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00) Front End Pre-fetch/Fetch of up to 6 instructions/cyclePre-fetch/Fetch of up to 6 instructions/cycle Hierarchy of branch predictorsHierarchy of branch predictors Decoupling bufferDecoupling buffer Instruction Delivery Dispersal of up to 6 instructions on 9 portsDispersal of up to 6 instructions on 9 ports Reg. remappingReg. remapping Reg. stack engineReg. stack engine Operand Delivery Reg read + BypassesReg read + Bypasses Register scoreboardRegister scoreboard Predicated dependencies Predicated dependencies Execution 4 single cycle ALUs, 2 ld/str4 single cycle ALUs, 2 ld/str Advanced load controlAdvanced load control Predicate delivery & branchPredicate delivery & branch Nat/Exception//RetirementNat/Exception//Retirement IPGFET ROTEXP RENREGEXEDETWRBWL.D REGISTER READ WORD-LINE DECODE RENAMEEXPAND INST POINTER GENERATION FETCH ROTATE EXCEPTION DETECT EXECUTEWRITE-BACK

EENG449b/Savvides Lec /30/04 Itanium processor 10-stage pipeline Front-end (stages IPG, Fetch, and Rotate): prefetches up to 32 bytes per clock (2 bundles) into a prefetch buffer, which can hold up to 8 bundles (24 instructions) –Branch prediction is done using a multilevel adaptive predictor like P6 microarchitecture Instruction delivery (stages EXP and REN): distributes up to 6 instructions to the 9 functional units –Implements registers renaming for both rotation and register stacking.

EENG449b/Savvides Lec /30/04 Itanium processor 10-stage pipeline Operand delivery (WLD and REG): accesses register file, performs register bypassing, accesses and updates a register scoreboard, and checks predicate dependences. –Scoreboard used to detect when individual instructions can proceed, so that a stall of 1 instruction in a bundle need not cause the entire bundle to stall Execution (EXE, DET, and WRB): executes instructions through ALUs and load/store units, detects exceptions and posts NaTs, retires instructions and performs write-back

EENG449b/Savvides Lec /30/04 Slide from HP

EENG449b/Savvides Lec /30/04 Slide from HP

EENG449b/Savvides Lec /30/04 Slide from HP

EENG449b/Savvides Lec /30/04 Slide from HP

EENG449b/Savvides Lec /30/04 Comments on Itanium Remarkably, the Itanium has many of the features more commonly associated with the dynamically-scheduled pipelines –strong emphasis on branch prediction, register renaming, scoreboarding, a deep pipeline with many stages before execution (to handle instruction alignment, renaming, etc.), and several stages following execution to handle exception detection Surprising that an approach whose goal is to rely on compiler technology and simpler HW seems to be at least as complex as dynamically scheduled processors!

EENG449b/Savvides Lec /30/04 Peformance of IA-64 Itanium? Despite the existence of silicon, no significant standard benchmark results are available for the Itanium Whether this approach will result in significantly higher performance than other recent processors is unclear The clock rate of Itanium (733 MHz) is competitive but slower than the clock rates of several dynamically-scheduled machines, which are already available, including the Pentium III, Pentium 4 and AMD Athlon

EENG449b/Savvides Lec /30/04 Itanium Performace SPECint

EENG449b/Savvides Lec /30/04 Itanium Performance SPECfp

EENG449b/Savvides Lec /30/04 Itanium Today & Tomorrow

EENG449b/Savvides Lec /30/04 VLIW in Embedded Designs VLIW: greater parallelism under programmer, compiler control vs. hardware in superscalar Used in DSPs, Multimedia processors as well as IA-64 What about code size? Effectiveness, Quality of compilers for these applications?

EENG449b/Savvides Lec /30/04 Example VLIW for multimedia: Philips Trimedia CPU Every instruction contains 5 operations Predicated with single register value; if 0 => all 5 operations are canceled bit registers, which contain either integer or floating point data Partitioned ALU (SIMD) instructions to compute on multiple instances of narrow data Offers both saturating arithmetic (DSPs) and 2’s complement arithmetic (desktop) Delayed Branch with 3 branch slots

EENG449b/Savvides Lec /30/04 Trimedia Operations large number of ops because used retargetable compilers, multiple machine descriptions, and die size estimators to explore the space to find the best cost- performance design –Verification time, manufacturing test, design time?

EENG449b/Savvides Lec /30/04 Trimedia Functional Units, Latency, Instruction Slots 23 functional units of 11 types, which of 5 slots can issue (and hence number of functional units)

EENG449b/Savvides Lec /30/04 Philips Trimedia CPU Compiler responsible for including no-ops –both within an instruction-- when an operation field cannot be used--and between dependent instructions –processor does not detect hazards, which if present will lead to incorrect execution Code size? compresses the code (~ Quiz #1) –decompresses after fetched from instruction cache

EENG449b/Savvides Lec /30/04 Example Using MIPS notation, look at code for void sum (int a[], int b[], int c[], int n) {int i; for (i=0; i<n; i++) c[i] = a[i]+b[i];

EENG449b/Savvides Lec /30/04 Example MIPS code for loop Loop:LDR11,R0(R4)# R11 = a[i] LDR12,R0(R5)# R12 = b[i] DADDUR17,R11,R12# R17 = a[i]+b[i] SDR17,0(R6) # c[i] = a[i]+b[i] DADDIUR4,R4,8# R4 = next a[] addr DADDIUR5,R5,8# R5 = next b[] addr DADDIUR6,R6,8# R6 = next c[] addr BNER4,R7,Loop# if not last go to Loop Then unroll 4 times and schedule

EENG449b/Savvides Lec /30/04 Tridmedia Version Loop address in register 30 Conditional jump (JMPF) so that only jump is conditional, not whole instruction predicated DADDUI (1st slot, 2nd instr) and SETEQ (1st slot, 3rd instr) compute loop termination test –Duplicate last add early enough to schedule 3 instruction branch delay 24/40 slots used (60%) in this example

EENG449b/Savvides Lec /30/04 Clock cycles to execute 2D iDCT Note that the Trimedia results are based on compilation, unlike many of the others. The year 2000 clock rate of the CPU64 is 300 MHz. The 1999 clock rates of the others are about 400 MHz for the PowerPC, PA-8000, and Pentium II, with the TM at 100 MHz and the TI x at 200 MHz.

EENG449b/Savvides Lec /30/04 Transmeta Crusoe MPU 80x86 instruction set compatibility through a software system that translates from the x86 instruction set to VLIW instruction set implemented by Crusoe VLIW processor designed for the low-power marketplace Typical applications –Notebook: Sony, others –Compact Servers: RLX technologies

EENG449b/Savvides Lec /30/04 Crusoe processor: Basics VLIW with in-order execution 64 Integer registers 32 floating point registers Simple in-order, 6-stage integer pipeline: 2 fetch stages, 1 decode, 1 register read, 1 execution, and 1 register write-back 10-stage pipeline for floating point, which has 4 extra execute stages Instructions in 2 sizes: 64 bits (2 ops) and 128 bits (4 ops)

EENG449b/Savvides Lec /30/04 Crusoe processor: Operations 5 different types of operation slots: ALU operations: typical RISC ALU operations Compute: this slot may specify any integer ALU operation (2 integer ALUs), a floating point operation, or a multimedia operation Memory: a load or store operation Branch: a branch instruction Immediate: a 32-bit immediate used by another operation in this instruction For 128-bit instr: 1st 3 are Memory, Compute, ALU; last field either Branch or Immediate

EENG449b/Savvides Lec /30/04 80x86 Compatability Initially, and for lowest latency to start execution, the x86 code can be interpreted on an instruction by instruction basis If a code segment is executed several times, translated into an equivalent Crusoe code sequence, and the translation is cached –The unit of translation is at least a basic block, since we know that if any instruction is executed in the block, they will all be executed –Translating an entire block both improves the translated code quality and reduces the translation overhead, since the translator need only be called once per basic block Assumes 16MB of main memory for cache

EENG449b/Savvides Lec /30/04 Exception Behavior during Speculation Crusoe support for speculative reordering consists of 4 major parts: 1. shadowed register file –Shadow discarded only when x86 instruction has no exception 2. program-controlled store buffer –Only store when no exception; keep until OK to store 3. memory alias detection hardware with speculative loads 4. conditional move instruction (called select) that is used to do if-conversion on x86 code sequences

EENG449b/Savvides Lec /30/04 Crusoe Performance? Crusoe depends on realistic behavior to tune the code translation process, it will not perform in a predictive manner when benchmarked using simple, but unrealistic scripts –Needs idle time to translate –Profiling to find hot spots To remedy this factor, Transmeta has proposed a new set of benchmark scripts –Unfortunately, these scripts have not been released and endorsed by either a group of vendors or an independent entity

EENG449b/Savvides Lec /30/04 Real Time, so comparison is Energy

EENG449b/Savvides Lec /30/04 Next Time Memory Hierarchies – Chapter 5 Homework 2 due April 20 th Midterm 2 April 25 Both homework 2 and midterm 2 cover class material from Chapters 3,4 and 5 Project Presentations during finals week or reading week…