The ESW Paradigm Manoj Franklin & Guirndar S. Sohi 05/10/2002.

Slides:



Advertisements
Similar presentations
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Embedded Computer Architectures Hennessy & Patterson Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation Gerard Smit (Zilverling 4102),
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism (ILP) Colin Stevens.
Chapter 12 Pipelining Strategies Performance Hazards.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Multiscalar processors
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CMPE 421 Parallel Computer Architecture
TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
INTRODUCTION TO MULTISCALAR ARCHITECTURE
Data Prefetching Smruti R. Sarangi.
Multiscalar Processors
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
CSE 431 Computer Architecture Fall 2005 Lecture 12: SS Front End (Fetch , Decode & Dispatch) Mary Jane Irwin ( )
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Morgan Kaufmann Publishers
Pipelining: Advanced ILP
Computer Architecture Lecture 3
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Computer Architecture Lecture 4 17th May, 2006
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Yingmin Li Ting Yan Qi Zhao
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Data Prefetching Smruti R. Sarangi.
CS-447– Computer Architecture Lecture 20 Cache Memories
How to improve (decrease) CPI
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
The University of Adelaide, School of Computer Science
Instruction Level Parallelism
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

The ESW Paradigm Manoj Franklin & Guirndar S. Sohi 05/10/2002

Observations Large exploitable ILP, theoretically Close instructions dependent; parallelism possible further down stream Centralized resources is bad Minimizing comm cost is important

What about others? -Dataflow model + most general -unconventional PL paradigm -comm cost can be high -SS, VLIW (sequential) + temporal locality -large centralized HW -compiler too dumb -not scalable -ESW = dataflow + sequential

Design Goals Decentralized resources Minimize wasted execution Speculative memory address disambiguation realizability Replace large dynamic window with many small ones

How it works Basic window –Single entry, loop-free, call-free block –Equal, superset or subset of basic block Execute basic windows in parallel Multiple independent stages –Complete with branch prediction, L1 cache, reg file…etc.

Dist Inst Supply Optimization: Snooping on L2-L1 Cache traffic

Dist Inter-Inst Comm Observation: 1.Register use mostly within basic block 2.The rest in subsequent blocks Architecture: 1.dist. future file 2.create/use masks for dep. check

Dist DMem System Problem: 1.Addr. space large, can’t create/use mask 2.Need to maintain consistency between multiple copies Solution: ARB

ARB Q. What happens when ARB’s full? - Bits cleared upon commit - Restart stages when dependency violated - When load, forward values from ARB if already exists

Simulation Environment Custom simulator using MIPS R2000 pipeline Up to 2 inst fetch/decode/issued/ per IE Up to 32 inst per basic window 4K word L1 cache, 64KB L2 DM Cache (100% hit rate, what??) 3-bit counter branch prediction

Results Optimizations: 1.Moving up instruction 2.Expand basic window (in eqntott and expresso) Basic window <= basic block But is 100% cache hit rate reasonable?

Discussion Compare this to CMP? RAW? Does the trade-off strike a balance?

New Results (1) In order execution

New Results (2) Out of order execution