Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

Slides:



Advertisements
Similar presentations
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advertisements

ARM Cortex-A9 MPCore ™ processor Presented by- Chris Cai (xiaocai2) Rehana Tabassum (tabassu2) Sam Mussmann (mussmnn2)
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Chapter 12 Pipelining Strategies Performance Hazards.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Sanghyun Park, §Aviral Shrivastava and Yunheung Paek
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Dynamic Scheduling Why go out of style?
Computer Architecture Chapter (14): Processor Structure and Function
Memory COMPUTER ARCHITECTURE
CSL718 : Superscalar Processors
Computer Architecture
SECTIONS 1-7 By Astha Chawla
CSC 4250 Computer Architectures
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
Single Clock Datapath With Control
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Comparison of Two Processors
Control unit extension for data hazards
CSC3050 – Computer Architecture
Conceptual execution on a processor which exploits ILP
Sizing Structures Fixed relations Empirical (simulation-based)
Presentation transcript:

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek

 Disparity between memory & processor speeds keeps increasing.  30-40% of instructions access memory.  Processor time wasted waiting for memory.

 Multiple issue  Value prediction  Out of order execution  Speculative execution  Problem:  Only suited for high end processors  Significant power & area overhead  Most embedded processors:  Single issue  Non-speculative  Alternative mechanisms reqd to hide memory latency with minimal power & area overheads

 Compiler classifies instructions into high and low priority.  Processor executes high priority inst.  Execution of low priority inst postponed.  Low priority instructions are executed on a cache miss.  Hardware reqd:  Simple reg renaming mechanism  Buffer to keep state of low priority inst.

 High priority instructions are reqd to continue :  Instruction fetch  Data transfers  Therefore, high priority instructions are:  Branches  Loads  Instructions feeding high priority instructions  All others are low priority.

1. Mark loads & branches. 2. Mark instructions that generate source operands of marked instructions. 3. Mark conditional inst that marked inst are control dependant on. 4. Loop between step 2 and 3. Marked instructions are high priority, rest low priority.

 Considers only instructions in loops.  First execute all high P instructions of the loop.  Once done, execute pending low P instructions.  Cannot wait till another cache miss, since cache misses rare outside loops.  ISA enhancements :  1 bit priority info  Special flushLowPriority instruction

 Two execution modes:  High priority  Low priority  Only high P executed.  Low P inst, operands renamed and queued in a ROB  When cache miss, switch to low P mode  Execute low P instructions  If cache miss resolved -> high P mode  If all low P executed and still cache miss unresolved -> high P mode

 Number of effective instructions in a loop reduced=>loop executes faster  Low P inst executed on cache miss=>utilisation of cache miss latency

 Architecturally similar to Out of order execution with compiler support.  Requires full reg renaming capability. P : 1 bit priority indicator Rename table : mapping info from real reg -> rename reg Rename manager : insts in ROB renamed dynamically suing info from table Priority selector : decides exec mode of processory

 Use Intel Xscale microarchitecture: Single fetch, single decode, single issue 7 stage out of order pipelined processor 32K 32 way Icache & Dcache  Experiments done on, innermost loops of several multimedia apps.  Specifically, 4 multimedia apps H.264 decoder from MediaBench 2 checksum benchmarks from MiBench 3 kernels for swim from SPEC benchmarks from DSPstone  Why were these benchmarks chosen ? Important kernels of multimedia apps Have non-negligible cache miss penalty due to data intensive calcs

 ROB = 100 entries  GSR shows most improvement  17% avg reduction in exec time

Laplace benchmark interesting So even if there is not much scope for improvement, this method can hide significant portions of memory latency.

 Saturates after ROB size = 100.  Because Intel Xscale avg mem latency = 75 cycles.

 Performance better than 1F1D1I  Performance better than 2F2D2I  Worse than 2F2D2I out order, but lot smaller power consumption

 Measuring effectiveness of this technique on high issue processors.  Questions ?