1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

André Seznec Caps Team IRISA/INRIA 1 The O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Lecture 8 Shelving in Superscalar Processors (Part 1)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.

1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)

André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Real-World Pipelines Idea –Divide process into independent stages –Move objects through stages in sequence –At any given times, multiple objects being.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Real-World Pipelines Idea Divide process into independent stages

Variable Word Width Computation for Low Power

Smruti R. Sarangi IIT Delhi

Chapter 4 The Von Neumann Model

Morgan Kaufmann Publishers The Processor

Microprocessor Microarchitecture Dynamic Pipeline

/ Computer Architecture and Design

Exploring Value Prediction with the EVES predictor

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Superscalar Processors & VLIW Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

* From AMD 1996 Publication #18522 Revision E

Out-of-Order Execution Structures Optimizations

Chapter 4 The Von Neumann Model

Efficient Interconnects for Clustered Microarchitectures

Spring 2019 Prof. Eric Rotenberg

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Sizing Structures Fixed relations Empirical (simulation-based)

Presentation transcript:

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA

AS-ET-OR Caps Team Irisa 2 Doubling the issue width  Functional Units  Silicon area: 2x  Power consumption: 2x  Same latency  Register file:  Silicon area: > 7x  Power consumption: > 4x  access time: 1.5x  Wake-up logic entries:  monitors twice as many inputs  area, consumption, response time  Bypass network:  wider multiplexors >2x  longer communications

AS-ET-OR Caps Team Irisa 3 An unwritten rule applied on all superscalar processor designs  For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit

AS-ET-OR Caps Team Irisa 4 Distributed register file C0C1C3C2 Local register file: shorter read access time but larger silicon area

AS-ET-OR Caps Team Irisa 5 The register file issue

AS-ET-OR Caps Team Irisa 6 Silicon area for the physical register file

AS-ET-OR Caps Team Irisa 7 8-way distributed register file 4 identical copies 14.5 W (x 4.5) 4 cycles (+1) 256 x 1792 w2 x W (x11,2) 8-way monolithic register file 16 W (x 5) 5 cycles (+2) 256 x 1120 w2 x W (x 7) 4-way distributed register file 2 identical copies 3.1W 3 cycles 128 x 320w2 x W 8-way against 4-way 100nm, 5 Ghz

AS-ET-OR Caps Team Irisa 8 Let us reduce the number of ports on each individual register

AS-ET-OR Caps Team Irisa 9 Register Write Specialization C1C0C2C3 S0 S1 S2 S3

AS-ET-OR Caps Team Irisa 10 Distributed Register File and Register Write Specialization C0C1C3C2

AS-ET-OR Caps Team Irisa 11 Register Write Specialization  Each cluster writes only a subset of the registers  Less write ports on every individual physical register  4-cluster 8-way distributed register file 512 entries  280 x w2 per register bit: 1/2 or 1/3 of conventional  3 cycles access time : saves 1 or 2 cycles  8.5 W against 14.5 or 16 But allocation must precede register renaming

AS-ET-OR Caps Team Irisa 12 Register Write Specialization and Register Renaming 1:Op R6, R7 -> R5 2:Op R2, R5 -> R6 3:Op R6, R3 -> R4 4:Op R4, R6 -> R2 4 free odd reg 4 free even reg 4-bit subset target vector 1:Op L6, L7 -> res1 2:Op L2, res1 -> res2 3:Op res2, L3 -> res3 4:Op res3,res2 -> res4 4 new free registers + Old map table 1:Op P6, P7 -> RES1 2:Op P2, RES1 -> RES2 3:Op RES2, P3 -> RES3 4:Op RES3,RES2 -> RES4 New map table

AS-ET-OR Caps Team Irisa 13 Register Write Specialization and Register Renaming (2)  Consumes a lot of registers : need for recycling 1:build two lists of registers to be recycled 2: pack both lists 3: concatenate the two lists 4: append to the free list

AS-ET-OR Caps Team Irisa 14 Register Write Specialization and Register Renaming (3)  An alternative:  Compute the number of registers in each register subset  Pick the right number of registers from each of the free lists  No need for recycling registers Think about round-robin distribution !

AS-ET-OR Caps Team Irisa 15 Performance issues  Register Write Specialization only:  round robin allocation: no extra stage for register renaming shorter register acces time Overall shorter pipeline: slightly better performance

AS-ET-OR Caps Team Irisa 16 Register Read Specialization C1C0C2C3 S0 S1

AS-ET-OR Caps Team Irisa 17 Register Read Specialization  Limits number of read ports on each individual register  Puts strong constraints on allocation of instructions to clusters  Caution:  Personal opinion: don’t use it alone ! Interconnection topology must ensure that every instruction is executable

AS-ET-OR Caps Team Irisa 18 WSRS architectures Combining Register Read Specialization and Register Write Specialization

AS-ET-OR Caps Team Irisa 19 4-cluster WSRS architecture S0 C0 S1 C1 S2 C2 S3 C3 S2 inst. operands positions determine the execution cluster

AS-ET-OR Caps Team Irisa 20 4-cluster WSRS architecture: allocating instructions to clusters S0 C0 S1 C1 S2 C2 S3 C3 S2 Op:R6,R7 R5 S1,S2 S0 First op determines top or down bicluster Second op determines left or right bicluster

AS-ET-OR Caps Team Irisa 21 4-cluster WSRS architecture : allocating instructions to clusters (2) Op:R6,R7 R5 S1,S2 S0 Computation of the two bits are independent :-)

AS-ET-OR Caps Team Irisa 22 Each individual physical register: 2 identical copies of (4-read, 3-write) registers 8x smaller than conventional monolithic approach 12.8x smaller than conventional distributed approach 4-cluster 8-way WSRS architecture : the register file WSRS 512 registers 6.25W, 3 cycles Conventional 256 registers (16W, 5 cycles) or (14.5W, 4 cycles)

AS-ET-OR Caps Team Irisa 23 4-cluster 8-way WSRS architecture : the wake-up logic  The wake-up logic monitors all possible sources for each operand  FUs from only two clusters are possible sources 8-way WSRS architecture, wake-up logic entry complexity = 4-way issue wake-up logic entry complexity

AS-ET-OR Caps Team Irisa 24 4-cluster 8-way WSRS architecture : bypass network  Possible sources for each operand  FUs from only two clusters are possible sources Bypass point (pipeline length) x (possible FU sources) + register file 8-way dist. 4 cycles 49 pos. op. WSRS= 4-way 3 cycles 19 pos. op. 8-way mon. 5 cycles 61 pos. op.

AS-ET-OR Caps Team Irisa 25 4-cluster WSRS architecture: Nothing is entirely free !  Strong constraint on allocation of instructions to clusters:  The cluster executing a dyadic instruction depends on the position of its operands in the register subsets.  Degrees of freedom:  Monadic instructions can be executed on two clusters  Three out of four commutative dyadic instructions can be executed on two distinct clusters  Design clusters able to execute instructions in two forms ? A-B and -B + A

AS-ET-OR Caps Team Irisa 26 Using monadic instructions for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 or S1

AS-ET-OR Caps Team Irisa 27 4-cluster WSRS architecture : nothing comes from free (2)  Extra free lists and associated logic  Extra pipeline stage(s):  Instructions must be allocated to clusters before the last step in register renaming: + 3 cycles  But shorter register access time : - 1 or 2 cycles

AS-ET-OR Caps Team Irisa 28 Performance issues on 4-cluster 8-way architectures  Workload may be unbalanced among the clusters:  Use of the degrees of freedom monadic instructions « commutative » clusters  Higher probability of local consumption of a register Naive allocation policies on WSRS compete with naive policies on conventional architecture

AS-ET-OR Caps Team Irisa 29 Orthogonal to most previous works Just apply previous proposals at cluster level

AS-ET-OR Caps Team Irisa 30 Summary  Register Write Specialization  limits power consumption, silicon area and access time  does not impair performance  But Some extra complexity in register renaming

AS-ET-OR Caps Team Irisa 31 Summary (2)  Register Write Specialization + Register Read Specialization  further limits power consumption, silicon area and access time on register file  limits wake-up logic and bypass network complexity  But  constraints instruction allocation to clusters

AS-ET-OR Caps Team Irisa 32 Future works  Intelligent instruction allocation policies  Exploration of other possible interconnections  SMT mode