1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

André Seznec Caps Team IRISA/INRIA 1 The O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

André Seznec Caps Team IRISA/INRIA Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata.

1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

1 A Tree Based Router Search Engine Architecture With Single Port Memories Author: Baboescu, F.Baboescu, F. Tullsen, D.M. Rosu, G. Singh, S. Tullsen, D.M.Rosu,

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.

CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

Lecture 8 Shelving in Superscalar Processors (Part 1)

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Computer Architecture Lecture 08 Fasih ur Rehman.

Parallelism Processing more than one instruction at a time. Pipelining

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day9:

1 Revisiting the perceptron predictor André Seznec IRISA/ INRIA.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

CSIS Parallel Architectures and Algorithms Dr. Hoganson Speedup Summary Balance Point The basis for the argument against “putting all your (speedup)

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

André Seznec Caps Team IRISA/INRIA 1 A 256 Kbits L-TAGE branch predictor André Seznec IRISA/INRIA/HIPEAC.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

André Seznec Caps Team IRISA/INRIA 1 Analysis of the O-GEHL branch predictor Optimized GEometric History Length André Seznec IRISA/INRIA/HIPEAC.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Real-World Pipelines Idea Divide process into independent stages

Variable Word Width Computation for Low Power

Morgan Kaufmann Publishers The Processor

/ Computer Architecture and Design

Exploring Value Prediction with the EVES predictor

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Superscalar Processors & VLIW Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Out-of-Order Execution Structures Optimizations

Efficient Interconnects for Clustered Microarchitectures

The O-GEHL branch predictor

Spring 2019 Prof. Eric Rotenberg

ECE 721, Spring 2019 Prof. Eric Rotenberg.

Sizing Structures Fixed relations Empirical (simulation-based)

Presentation transcript:

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA

AS-ET-OR Caps Team Irisa 2 Why designing wide issue superscalar processors SMT Superscalar Processors !

AS-ET-OR Caps Team Irisa 3 Doubling the issue width  Functional Units  Silicon area: 2x  Power consumption: 2x  Same latency  Register file:  Silicon area: > 8x  Power consumption: > 4x  access time: 1.5x  Wake-up logic entries:  monitors twice as many inputs  area, consumption, response time  Bypass network:  wider multiplexors >2x  longer communications

AS-ET-OR Caps Team Irisa 4 An unwritten rule applied on all superscalar processor designs  For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit

AS-ET-OR Caps Team Irisa 5 The register file issue

AS-ET-OR Caps Team Irisa 6 Silicon area for the physical register file

AS-ET-OR Caps Team Irisa 7 Conventional clustered design C1C0C2C3 Register File

AS-ET-OR Caps Team Irisa 8 Distributed register file C0C1C3C2 Local register file: shorter read access time but larger silicon area

AS-ET-OR Caps Team Irisa 9 8-way distributed register file 4 identical copies 14.5 W (x 4.5) 4 cycles (+1) 256 x 1792 w2 x W (x11) 8-way monolithic register file 16 W (x 5) 5 cycles (+2) 256 x 1120 w2 x W (x 8) 4-way distributed register file 2 identical copies 3.1W 3 cycles 128 x 320w2 x W 8-way against 4-way 100nm, 5 Ghz

AS-ET-OR Caps Team Irisa 10 Let us reduce the number of ports on each individual register

AS-ET-OR Caps Team Irisa 11 Register Write Specialization C1C0C2C3 S0 S1 S2 S3

AS-ET-OR Caps Team Irisa 12 Distributed Register File and Register Write Specialization C0C1C3C2

AS-ET-OR Caps Team Irisa 13 Register Write Specialization  Each cluster writes only a subset of the registers  Less write ports on every individual physical register  But allocation to clusters must precede register renaming  4-cluster 8-way distributed register file 512 entries  320 x w2 per register bit  3 cycles access time  8.5 W

AS-ET-OR Caps Team Irisa 14 Register Write Specialization and Register Renaming 1:Op R6, R7 -> R5 2:Op R2, R5 -> R6 3:Op R6, R3 -> R4 4:Op R4, R6 -> R2 4 free odd reg 4 free even reg 4-bit subset target vector 1:Op L6, L7 -> res1 2:Op L2, res1 -> res2 3:Op res2, L3 -> res3 4:Op res3,res2 -> res4 4 new free registers + Old map table 1:Op P6, P7 -> RES1 2:Op P2, RES1 -> RES2 3:Op RES2, L3 -> RES3 4:Op RES3,RES2 -> RES4 New map table

AS-ET-OR Caps Team Irisa 15 Register Write Specialization and Register Renaming (2)  Consumes a lot of registers : need for recycling 1:build two lists of registers to be recycled 2: pack both lists 3: concatenate the two lists 4: append to the free list

AS-ET-OR Caps Team Irisa 16 Register Write Specialization and Register Renaming (3)  An alternative:  Compute the number of registers in each register subset  Pick the right number of registers from each of the free lists  No need for recycling registers Think about round-robin distribution !

AS-ET-OR Caps Team Irisa 17 Performance issues  Register Write Specialization only:  round robin allocation: no extra stage for register renaming shorter register acces time Overall shorter pipeline: slightly better performances

AS-ET-OR Caps Team Irisa 18 Register Read Specialization C1C0C2C3 S0 S1

AS-ET-OR Caps Team Irisa 19 Register Read Specialization  Limits number of read ports on each individual register  Puts strong constraints on allocation of instructions to clusters  Caution:  Personal opinion: don’t use it alone ! Interconnection topology must ensure that every instruction is executable

AS-ET-OR Caps Team Irisa 20 WSRS architectures Combining Register Read Specialization and Register Write Specialization

AS-ET-OR Caps Team Irisa 21 4-cluster WSRS architecture S0 C0 S1 C1 S2 C2 S3 C3 S2 inst. operands positions determine the execution cluster

AS-ET-OR Caps Team Irisa 22 4-cluster WSRS architecture: allocating instructions to clusters S0 C0 S1 C1 S2 C2 S3 C3 S2 Op:R6,R7 R5 S1,S2 S0 First op determines top or down bicluster Second op determines left or right bicluster

AS-ET-OR Caps Team Irisa 23 4-cluster WSRS architecture : allocating instructions to clusters (2) Op:R6,R7 R5 S1,S2 S0 Computation of the two bits are independent :-)

AS-ET-OR Caps Team Irisa 24 Each individual physical register: 4 identical copies of (2-read, 3-write) registers 8x smaller than conventional monolithic approach 12.8x smaller than conventional distributed approach 4-cluster 8-way WSRS architecture : the register file WSRS 512 registers 6.25W, 3 cycles Conventional 256 registers (16W, 5 cycles) or (14.5W, 4 cycles)

AS-ET-OR Caps Team Irisa 25 4-cluster 8-way WSRS architecture : the wake-up logic  The wake-up logic monitors all possible sources for each operand  FUs from only two clusters are possible sources  only 6 possible sources ! 8-way WSRS architecture, wake-up logic entry complexity = 4-way issue wake-up logic entry complexity

AS-ET-OR Caps Team Irisa 26 4-cluster 8-way WSRS architecture : bypass network  Possible sources for each operand  FUs from only two clusters are possible sources Bypass point (pipeline length) x (possible FU sources) + register file 8-way dist. 4 cycles 49 pos. op. WSRS 3 cycles 19 pos. op. 8-way mon. 5 cycles 61 pos. op.

AS-ET-OR Caps Team Irisa 27 Local fast-forwarding inside a single cluster 2 out of 4 consumers are reached on the next cycle Partial fast-forwarding inside a pair of adjacent clusters: 3 out of 4 consumers are reached on the next cycle ! Complete fast-forwarding: consumer is close: may be possible to implement! 4-cluster WSRS architecture : fast-forwarding

AS-ET-OR Caps Team Irisa 28 4-cluster WSRS architecture: Nothing is entirely free !  Strong constraint on allocation of instructions to clusters:  The cluster executing a dyadic instruction depends on the position of its operands in the register subsets.  Degrees of freedom:  Monadic instructions can be executed on two clusters  One out of two commutative dyadic instructions can be executed on two clusters  Design clusters able to execute instructions in two forms ? A-B and -B + A

AS-ET-OR Caps Team Irisa 29 Using monadic instructions for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 or S1

AS-ET-OR Caps Team Irisa 30 Commutativity for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 op S2

AS-ET-OR Caps Team Irisa 31 4-cluster WSRS architecture : nothing comes from free (2)  Extra free lists and associated logic  Extra pipeline stage(s):  Instructions must be allocated to clusters before the last step in register renaming: + 3 cycles  But shorter register access time : - 2 cycles

AS-ET-OR Caps Team Irisa 32 Performance issues on 4-way WSRS architectures  Workload may be unbalanced among the clusters:  Use of the degrees of freedom monadic instructions « commutative » clusters  Higher probability of local consumption of a register Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture

AS-ET-OR Caps Team Irisa 33 Summary  Register Write Specialization  limiting the number of write ports on each physical register  leads to naturally use distributed register file  mastering power consumption, silicon area and access time  But Some extra complexity in register renaming

AS-ET-OR Caps Team Irisa 34 Summary (2)  Register Write Specialization + Register Read Specialization  Further limits the number of ports on each physical register  mastering power consumption, silicon area and access time  side effects: mastering wake-up logic and bypass network complexity  But  constraints instruction allocation to clusters

AS-ET-OR Caps Team Irisa 35 Future works  Intelligent instruction allocation policies  Exploration of other possible interconnections  Use of heterogeneous clusters  SMT mode