Hardware Support for Compiler Speculation

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Organization and Architecture
Recap Measuring and reporting performance Quantitative principles Performance vs Cost/Performance.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Organization and Architecture
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 12 Pipelining Strategies Performance Hazards.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Chapter 12 CPU Structure and Function. Example Register Organizations.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
Final Review Prof. Mike Schulte Advanced Computer Architecture ECE 401.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
CS 352H: Computer Systems Architecture
Advanced Architectures
A Closer Look at Instruction Set Architectures
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
Pipelining: Advanced ILP
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
IA-64 Microarchitecture --- Itanium Processor
Yingmin Li Ting Yan Qi Zhao
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Control unit extension for data hazards
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Control unit extension for data hazards
Loop-Level Parallelism
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Hardware Support for Compiler Speculation Compiler needs to move instructions before branch, possibly before condition Requirements: Instructions that can be moved without disrupting data flow Exceptions that can be ignored until outcome is known Ability to speculatively access memory with potential address conflicts

Exception Support Four methods: Hardware and OS cooperate to ignore exceptions for speculative instructions Speculative instructions never raise exceptions; explicit checks must be made Poison bits used to mark registers with invalid results; use causes exception Speculative results are buffered until certain

Exception Handling Nonterminating exceptions can be handled normally (e.g. page fault) May cause serious performance loss

Memory Reference Speculation Moving loads across stores is only safe if the addresses do not conflict Special instructions check for address conflicts

4.6. Crosscutting Issues: Hardware –vs– Software Speculation A number of trade-offs and limitations Disambiguating memory references is hard for a compiler Hardware branch prediction is usually better Precise exceptions easier in hardware Hardware does not require “housekeeping” code Compilers can “look” further Hardware techniques are more portable

Hardware/Software Speculation Major disadvantage of hardware: complexity! Some architectures combine hardware and software approaches

4.7. Putting It All Together: IA-64 and Itanium RISC-style Register-register Emphasis on software-based optimisations Features: 128 × 65-bit integer registers 128 × 82-bit FP registers 64 predicate registers; 8 branch registers

Registers Integer registers Use windowing mechanism 0–31 always visible Remainder arranged in overlapping windows Local and out areas (variable size) Hardware for over-/underflow Int and FP registers support register rotation Supports software pipelining

Instruction Format and VLIW Compiler schedules parallel instructions; flags dependences Instruction group Sequence of (register) independent instructions Compiler marks boundaries between groups (stop) Bundle 128-bits: 5-bit template + 3 × 41-bit instructions

Instruction Bundle Template specifies stops and execution unit I-unit (int + special — multimedia, etc.) M-unit (int + memory access) F-unit (FP) B-unit (branches) L+X (extended instructions)

Example Unrolled seven times for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } Unrolled seven times Optimised for size: 9 bundles; 15% nops 21 cycles (3 per calculation) Optimised for performance: 11 bundles; 30% nops 12 cycles (1.7 per calculation)

Instructions 41-bits long Predication 4-bit opcode (+ template bits) 6-bit predicate register specifier Predication Almost all instructions can be predicated Branch is jump with predicate check! Complex comparisons set two predicate registers

Speculation Exceptions can be deferred Speculative loads Uses poison bits (65-bit registers) Nonspeculative and chk instructions raise exception Speculative loads Called advanced load (ld.a) Stores check addresses

Itanium First implementation of IA-64 Issues up to six instructions per cycle (two bundles) Nine functional units 2 × I, 2 × M, 3 × B, 2 × F 10-stage pipeline Multilevel dynamic branch predictor

Itanium Complex hardware with many features of dynamically scheduled pipelines! Branch prediction Register renaming Scoreboarding Deep pipeline etc.

Itanium: Performance SPECint not too impressive FP better 85% of Alpha 21264 (older, more power-efficient processor!) FP better Faster, even with slower clock! But skewed by one benchmark for Pentium Alpha compilers need improvement

4.8. Another View: ILP in Embedded Processors Trimedia (see chapter 2) “Classic” VLIW Hardware decompression of code Crusoe Software translation of 80x86 to VLIW Low power

Trimedia TM32 Architecture VLIW Instruction specifies five operations Static scheduling No hardware hazard detection 23 functional units (11 types)

Transmeta Crusoe Low power design Emulates 80x86 VLIW 64-bit (2 op) and 128-bit (4 op) instructions Five types of operations: ALU (int, register-register) Compute (int ALU, FP, multimedia) Memory Branch Immediate

Crusoe Simple, in-order pipeline Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB) FP: 10-stage (5 EX stages)

Crusoe Software interpretation of 80x86 code: Basic blocks cached Exception handling complicated Crusoe has good support for speculative reordering Memory writes buffered and committed only when safe

Crusoe Performance Hard to measure accurately Power consumption is low (⅓ of Pentium)

4.9. Fallacies and Pitfalls Fallacy: There is a simple approach to multiple-issue (high performance with low complexity) Big gap between peak and sustained performance for multiple issue processors Need dynamic scheduling, speculation support, branch prediction, sophisticated prefetch, etc. Sophisticated compilers are required

4.10. Concluding Comments “Hardware” techniques migrating to “software” and vice versa Multiprocessors may be important in future

Chapter 5 Memory Hierarchy Design

Memory Hierarchies Not a new idea! Takes advantage of the principle of locality Temporal Spatial Small, fast memories close to processor

Memory Hierarchies Registers Speed Size Cost Cache Memory I/O Devices (virtual memory) Speed Cost Size

Introduction Usually includes responsibility for memory protection Performance is a major problem

Figure 5.2

Characterising Levels of the Memory Hierarchy Four questions: Where can a block be placed? (placement) How is a block found? (identification) Which block should be replaced on a miss? (replacement) What happens on a write? (write strategy)

Example The Alpha 21264 is used as an example throughout

Caches Where is a block placed in a cache? Three possible answers  three different types Anywhere Fully associative Only into one block Direct mapped Into subset of blocks Set associative

Cache Categories Set associative Direct-mapped Fully associative n-way set associative, where n is number of blocks in set Commonly, n = 2 or n = 4 Direct-mapped “1-way set associative” Fully associative “m-way set associative” (m is total number of blocks in cache)