Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Performance of Cache Memory

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Very low power pipelines using significance compression Canal, R. Gonzalez, A. Smith, J.E. Dept. d'Arquitectura de Computadors, Univ. Politecnica de Catalunya,

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Ryota Shioya, Masahiro Goshimay and Hideki Ando Micro 47 Presented by Kihyuk Sung.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Architecture Basics ECE 454 Computer Systems Programming

Revisiting Load Value Speculation:

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Pipelining Basics.

COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.

Pipelining and Parallelism Mark Staveley

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.

The Alpha – Data Stream Matt Ziegler.

1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

COSC3330 Computer Architecture

Cache Memory and Performance

How will execution time grow with SIZE?

Lynn Choi Dept. Of Computer and Electronics Engineering

ECE 445 – Computer Organization

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Address-Value Delta (AVD) Prediction

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture: Out-of-order Processors

Lecture 20: OOO, Memory Hierarchy

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Register Cache System not for Latency Reduction Purpose Ryota Shioya, Kazuo Horio, Masahiro Goshima, and Shuichi Sakai The University of Tokyo 1

Outline of research The area of register files of OoO superscalar processors has been increasing  It is comparable to that of an L1 data cache  A large RF causes many problems Area, complexity, latency, energy, and heat … A register cache was proposed to solve these problems  But … miss penalties most likely degrade IPC We solve the problems with a “not fast” register cache  Our method is faster than methods with a “fast” register caches  Circuit area is reduced to 24.9% with only 2.1% IPC degradation 2

Background : The increase in the circuit area of register files The area of register files of OoO superscalar processors has been increasing 1.The increase in the number of entries The required entry number of RFs is proportional to in- flight instructions Multi-threading requires a number of entries that are proportional to the thread number 2.The increase in the number of ports A register file consists of a highly ported RAM The area of a RAM is proportional to the square of the port number 3

RF area is comparable to that of an L1 data cache Ex: Integer execution core of Pentium4 4 L1 Data cache 16KB 1 read/1 write Register file 64 bits 144 entries 6 read/3 write (double pumped) （ from ）

Problems of large register files 1.The increase of energy consumption and heat from it  The energy consumption of a RAM is proportional to its area  A region including a RF is a hot spot in a core It limits clock frequency 2.The increase in access latency  Access latency of a RF is also comparable to that of an L1D$  A RF is pipelined The increase in bypass network complexity A deeper pipeline decreases IPC Prediction miss penalties Stalling due to resource shortage 5

Register cache proposed to solve the problems of a large RF Register cache （ RC ）  A small & fast buffer  It caches values in a main register file （ MRF ） Miss penalties of a RC most likely degrade IPC  Ex. It degrades IPC by 20.9% ( 8 entry RC system ) 6

Our proposal Research goal :  Reducing the circuit area of a RF We achieve this with a “not fast” register cache  Our RC does not reduce latency  Our method is faster than methods with “fast” register caches 7

Our proposal Usually, the definition of a cache is “cache is a fast one”  Text books  Wikipedia  You may think …  In this definition Our proposal is not a cache One may think …  Why a RC that does not reduce latency works A not fast cache is usually meaningless  Why a not fast cache overcomes a fast cache 8

Outline 1.Conventional method : LORCS : Latency-Oriented Register Cache System 2.Our proposal ： NORCS : Non-latency-Oriented Register Cache System 3.Details of NORCS 4.Evaluation 9

Register cache Register cache (RC)  A RC is a small and fast buffer  caches values in the main register file （ MRF ） 10

Register cache system A register cache system refers to a system that includes the followings : Register cache Main register file Pipeline stages 11 Register cache Instruction Window Instruction Window Write Buffer Main Register File ALUs Out of pipelines

LORCS : Latency-Oriented-Register Cache System Two different register cache system 1.Latency-Oriented : Conventional method 2.Non-Latency-Oriented : Our proposal LORCS : Latency-Oriented-Register Cache System  A conventional RC system is referred to as it in order to distinguish it from our proposal  Usually, the “Register Cache” refers this 12

LORCS : Latency-Oriented-Register Cache System Pipeline that assumes hit  All instructions are scheduled assuming their operands will hit  If all operands hit, It is equivalent to a system with a 1-cycle latency RF  On the RC miss, pipeline is stalled (miss penalty)  This pipeline is the same as a general data cache 13

Behavior of LORCS On RC miss, the processor backend is stalled  MRF is accessed in the meantime 14 cycle I 1 : I 2 : I 3 : I 4 : I 5 : IS CR EX IS CR IS CW EX CW CR EX CW IS CR EX CW IS CR EX CW RR Miss Stall CR RC Read CW RC Write RR MRF Read CR CW RR

Consideration of RC miss penalties The processor backend is stalled on RC misses OoO superscalar processors can selectively delay instructions with long latency  One may think that the same can be used for RC misses Selective delay of instructions on RC misses is difficult  Non-re-schedulability of a pipeline 15

Non-re-schedulability of a pipeline Instructions cannot be re-scheduled in pipelines 16 Register cache Instruction Window Write Buffer Main Register File ALUs I0 I1 I2 I3 I4 I5 I0 misses RC Out of pipelines

Purposes of LORCS 1.Reducing latency ：  In the ideal case, it can solve all the problems of pipelined RFs 1.Reducing prediction miss penalties 2.Releasing resources earlier 3.Simplifies bypass network 2.Reducing ports of a MRF 17

Reducing ports of a MRF Only operands that caused RC misses access MRF  requires only a few ports Reducing the area of the MRF  RAM area is proportional to the square of the port number 18 ALUs Register cache Main Register file ALUs Register file

Outline 1.Conventional method : LORCS : Latency-Oriented Register Cache System 2.Our proposal ： NORCS : Non-latency-Oriented Register Cache System 3.Details of NORCS 4.Evaluation 19

NORCS : Non-latency-Oriented Register Cache System NORCS also consists of  RC, MRF, and Pipeline stages It has a similar structure to LORCS  It can reduce the area of a MRF by a similar way of LORCS  NORCS is much faster than LORCS Important difference  NORCS has stages accessing a MRF 20

NORCS has stages accessing the MRF 21 LORCS: NORCS: The difference is quite small, but it makes a great difference to the performance Out of pipelines

Pipeline of NORCS NORCS has stages to access the MRF Pipeline that assumes miss ：  All instructions wait as if it had missed regardless of hit / miss It is a cache, but  does not make processor fast 22

CR Pipeline that assumes miss Pipeline of NORCS is not stalled only on RC misses The pipeline is stalled when the ports of the MRF are short. 23 IS EX CW cycle RR CR IS EX CW RR CR IS EX CW RR CR IS EX CW RR RC hit RR ： does not access MRF ： accesses MRF RR I 1 : I 2 : I 3 : I 4 : I 5 : CR IS EX CW RR RC hit RC miss CR RC Read CW RC Write RR MRF Read CR CW RR

Pipeline disturbance Conditions of pipeline stalls :  LORCS : If one ore more register cache misses occur at a single cycle  NORCS: If RC misses occur more than MRF read ports at a single cycle The MRF is accessed in the meantime This probability is very small Ex. 1.0% in NORCS with 16 entries RC 24

Outline 1.Conventional method : LORCS : Latency-Oriented Register Cache System 2.Our proposal ： NORCS : Non-latency-Oriented Register Cache System 3.Details of NORCS  Why NORCS can work  Why NORCS is faster than LORCS 4.Evaluation 25

Why NORCS can work A pipeline of NORCS can work only for a RC  A not fast cache is meaningless in general data caches Explain this by :  data flow graph (DFG)  dependencies There are two types of dependencies “Value – value” dependency “Via – Index” dependency 26

Value – value dependency Data cache  A dependency from a store to a dependent load instruction  The increase in the latency possibly heightens the DFG 27 store [a] ← r1 load r1 ← [a] Height of DFG L1 AC store [a] ← r1 load r1 ← [a] Height of DFG L1 AC Cache Latency

Value – value dependency 28 Register cache  RC latency is independent of the DFG height store [a] ← r1 load r1 ← [a] Height of DFG mov r1 ← A add r2 ← r1 + B Height of DFG CW CR EX L1 AC Bypass net can pass the data Data cache

Via index dependency Data cache  A dependency from an instruction to a dependent load instruction  Ex. : Pointer chasing in a linked list 29 Register cache  There are no dependencies from execution results to register number  The RC latency is independent of the height of DFG load r1 ← [a] load r2 ← [r1] Height of DFG L1 AC

Why NORCS is faster than LORCS  NORCS : always waits  LORCS : waits only when needed Because : 1.LORCS suffers from RC miss penalties with high probability  Stalls occur more frequently than RC miss of each operand 2.NORCS can avoid these penalties  Can shift RC miss penalties to branch miss prediction penalties with low probability 30

Stalls occur more frequently than RC miss of each operand In LORCS, a pipeline is stalled when ：  One or more register cache misses occur at a single cycle Ex. ： SPEC CPUINT hmmer / 32 entries RC  RC miss rate : 5.8% ( = hit rate is 94.2%)  Source operands per cycle are 2.49  Probability that all operands hit ≒ 0.86  Probability to stall (any of operands miss) ≒ 0.14 (14%) Our evaluation shows more than 10% IPC degradation in this benchmark 31

EXISIF penalty bpred cycle EX IFEX RR1RR2 IFEX IS IFEXIS EXISIF penalty bpred + latency MRF EX IFEX IS latency MRF IFEXIS IFEXIS CR RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 CR I4:I4: I2:I2: I1:I1: I3:I3: I5:I5: I4:I4: I2:I2: I1:I1: I3:I3: I5:I5: LORCS NORCS latency MRF NORCS can shift those RC miss penalties to branch miss prediction penalties.

EXISIF penalty bpred cycle EX IFEX RR1RR2 IFEX IS IFEXIS EXISIF penalty bpred + latency MRF EX IFEX IS latency MRF IFEXIS IFEXIS CR RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 RR1RR2 CR I4:I4: I2:I2: I1:I1: I3:I3: I5:I5: I4:I4: I2:I2: I1:I1: I3:I3: I5:I5: LORCS NORCS latency MRF NORCS can shift those RC miss penalties to branch miss prediction penalties. 5% 14%

Outline 1.Conventional method : LORCS : Latency-Oriented Register Cache System 2.Our proposal ： NORCS : Non-latency-Oriented Register Cache System 3.Details of NORCS 4.Evaluation 34

Evaluation environment 1.IPC ：  Using “Onikiri2” simulator Cycle accurate processor simulator developed in our laboratory 2.Area & energy consumption  Using CACTI 5.3 ITRS 45nm technology node  Evaluate arrays of RC & MRF Benchmark ：  All 29 programs of SPEC CPU 2006  compiled using gcc with –O3 35

3 evaluation models 1.PRF  Pipelined RF (Baseline)  8read/4write ports / 128 entries RF 2.LORCS  RC ： LRU Use based replacement policy with a use predictor J. A. Butts and G. S. Sohi, “Use-Based Register Caching with Decoupled Indexing,” in Proceedings of the ISCA, 2004,  2read/2write ports / 128 entries MRF 3.NORCS  RC : LRU only 36

Simulation Configurations Configurations are set to match modern processors 37 NameValue fetch width4 inst. instruction window int ： 32, fp ： 16, mem ： 16 execution unit int ： 2, fp ： 2, mem ： 2 MRF int ： 128, fp : 128 BTB2K entries, 4way Branch predictorg-share 8KB L1C 32KB ， 4way ， 2 cycles L2C 4MB ， 8way ， 10 cycles Main memory200 cycles

Average probabilities of stalls and branch prediction miss 38 In NORCS, the probabilities of stalls are very row In LORCS, the probabilities of stalls are much higher than branch miss penalty in the models with 4,8, and 16 entries RC

Average relative IPC All IPCs are normalized by that of PRF 39 In LORCS, IPC degrades significantly in the models with 4,8,and 16 entries RC In NORCS, IPC degradation is very small LORCS with 64 entries LRU RCs and NORCS with a 8 entries RC are almost same IPC …But

Trade-off between IPC and area RC entries ： 4, 8, 16, 32, 64 NORCS with 8entries RC :  Area is reduced by 75.1% with only 2.1% IPC degradation from PRF Area -75.1%

Trade-off between IPC and area RC entries ： 4, 8, 16, 32, 64 NORCS with 8entries RC :  Area is reduced by 72.4% from that of LORCS-LRU with same IPC  IPC is improved by 18.7% from that of LORCS-LRU with same area Area -72.4% IPC 18.7%

Trade-off between IPC and energy consumption RC entries ： 4, 8, 16, 32, IPC 18.7% Energy -68.1% Our proposal （ NORCS with 8entries RC ） :  Energy consumption is reduced by 68.1% with only 2.1% IPC degradation

Conclusion A large RF causes many problems  Especially energy and heat are serious A register cache was proposed to solve these problems  Miss penalties degrade IPC NORCS ： Non-latency-Oriented Register Cache System  We solve the problems with a “not fast” register cache  Our method overcomes methods with “fast” register caches  Circuit area is reduced by 75.1% with only 2.1% IPC degradation 43

Appendix 44

45

Onikiri2 Implementation  Abstract pipeline model for OoO superscalar processor  Instruction emulation is performed in exec stages Incorrect result for incorrect scheduling Features  Written in C++  Support ISA : Alpha and PowerPC  Can execute all the programs in SPEC 2000/

Arbiter Register cache Main register file Out of instruction pipeline hit/miss ALU Read reg num Write reg num /data Write reg num /data

Data array Arbiter Write buffer Tag array Register cache Main register file Write data hit/miss ALU Read reg num Write reg num /data Write reg num /data Added latches

Relative area %

Relative energy consumption 50

Probabilities of RC miss / penalty / Branch miss 51