Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,

Similar presentations


Presentation on theme: "1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,"— Presentation transcript:

1 1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA

2 AS-ET-OR Caps Team Irisa 2 Doubling the issue width  Functional Units  Silicon area: 2x  Power consumption: 2x  Same latency  Register file:  Silicon area: > 7x  Power consumption: > 4x  access time: 1.5x  Wake-up logic entries:  monitors twice as many inputs  area, consumption, response time  Bypass network:  wider multiplexors >2x  longer communications

3 AS-ET-OR Caps Team Irisa 3 An unwritten rule applied on all superscalar processor designs  For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit

4 AS-ET-OR Caps Team Irisa 4 Distributed register file C0C1C3C2 Local register file: shorter read access time but larger silicon area

5 AS-ET-OR Caps Team Irisa 5 The register file issue

6 AS-ET-OR Caps Team Irisa 6 Silicon area for the physical register file

7 AS-ET-OR Caps Team Irisa 7 8-way distributed register file 4 identical copies 14.5 W (x 4.5) 4 cycles (+1) 256 x 1792 w2 x W (x11,2) 8-way monolithic register file 16 W (x 5) 5 cycles (+2) 256 x 1120 w2 x W (x 7) 4-way distributed register file 2 identical copies 3.1W 3 cycles 128 x 320w2 x W 8-way against 4-way 100nm, 5 Ghz

8 AS-ET-OR Caps Team Irisa 8 Let us reduce the number of ports on each individual register

9 AS-ET-OR Caps Team Irisa 9 Register Write Specialization C1C0C2C3 S0 S1 S2 S3

10 AS-ET-OR Caps Team Irisa 10 Distributed Register File and Register Write Specialization C0C1C3C2

11 AS-ET-OR Caps Team Irisa 11 Register Write Specialization  Each cluster writes only a subset of the registers  Less write ports on every individual physical register  4-cluster 8-way distributed register file 512 entries  280 x w2 per register bit: 1/2 or 1/3 of conventional  3 cycles access time : saves 1 or 2 cycles  8.5 W against 14.5 or 16 But allocation must precede register renaming

12 AS-ET-OR Caps Team Irisa 12 Register Write Specialization and Register Renaming 1:Op R6, R7 -> R5 2:Op R2, R5 -> R6 3:Op R6, R3 -> R4 4:Op R4, R6 -> R2 4 free odd reg 4 free even reg 4-bit subset target vector 1:Op L6, L7 -> res1 2:Op L2, res1 -> res2 3:Op res2, L3 -> res3 4:Op res3,res2 -> res4 4 new free registers + Old map table 1:Op P6, P7 -> RES1 2:Op P2, RES1 -> RES2 3:Op RES2, P3 -> RES3 4:Op RES3,RES2 -> RES4 New map table

13 AS-ET-OR Caps Team Irisa 13 Register Write Specialization and Register Renaming (2)  Consumes a lot of registers : need for recycling 1:build two lists of registers to be recycled 2: pack both lists 3: concatenate the two lists 4: append to the free list

14 AS-ET-OR Caps Team Irisa 14 Register Write Specialization and Register Renaming (3)  An alternative:  Compute the number of registers in each register subset  Pick the right number of registers from each of the free lists  No need for recycling registers Think about round-robin distribution !

15 AS-ET-OR Caps Team Irisa 15 Performance issues  Register Write Specialization only:  round robin allocation: no extra stage for register renaming shorter register acces time Overall shorter pipeline: slightly better performance

16 AS-ET-OR Caps Team Irisa 16 Register Read Specialization C1C0C2C3 S0 S1

17 AS-ET-OR Caps Team Irisa 17 Register Read Specialization  Limits number of read ports on each individual register  Puts strong constraints on allocation of instructions to clusters  Caution:  Personal opinion: don’t use it alone ! Interconnection topology must ensure that every instruction is executable

18 AS-ET-OR Caps Team Irisa 18 WSRS architectures Combining Register Read Specialization and Register Write Specialization

19 AS-ET-OR Caps Team Irisa 19 4-cluster WSRS architecture S0 C0 S1 C1 S2 C2 S3 C3 S2 inst. operands positions determine the execution cluster

20 AS-ET-OR Caps Team Irisa 20 4-cluster WSRS architecture: allocating instructions to clusters S0 C0 S1 C1 S2 C2 S3 C3 S2 Op:R6,R7 R5 S1,S2 S0 First op determines top or down bicluster Second op determines left or right bicluster

21 AS-ET-OR Caps Team Irisa 21 4-cluster WSRS architecture : allocating instructions to clusters (2) Op:R6,R7 R5 S1,S2 S0 Computation of the two bits are independent :-)

22 AS-ET-OR Caps Team Irisa 22 Each individual physical register: 2 identical copies of (4-read, 3-write) registers 8x smaller than conventional monolithic approach 12.8x smaller than conventional distributed approach 4-cluster 8-way WSRS architecture : the register file WSRS 512 registers 6.25W, 3 cycles Conventional 256 registers (16W, 5 cycles) or (14.5W, 4 cycles)

23 AS-ET-OR Caps Team Irisa 23 4-cluster 8-way WSRS architecture : the wake-up logic  The wake-up logic monitors all possible sources for each operand  FUs from only two clusters are possible sources 8-way WSRS architecture, wake-up logic entry complexity = 4-way issue wake-up logic entry complexity

24 AS-ET-OR Caps Team Irisa 24 4-cluster 8-way WSRS architecture : bypass network  Possible sources for each operand  FUs from only two clusters are possible sources Bypass point (pipeline length) x (possible FU sources) + register file 8-way dist. 4 cycles 49 pos. op. WSRS= 4-way 3 cycles 19 pos. op. 8-way mon. 5 cycles 61 pos. op.

25 AS-ET-OR Caps Team Irisa 25 4-cluster WSRS architecture: Nothing is entirely free !  Strong constraint on allocation of instructions to clusters:  The cluster executing a dyadic instruction depends on the position of its operands in the register subsets.  Degrees of freedom:  Monadic instructions can be executed on two clusters  Three out of four commutative dyadic instructions can be executed on two distinct clusters  Design clusters able to execute instructions in two forms ? A-B and -B + A

26 AS-ET-OR Caps Team Irisa 26 Using monadic instructions for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 or S1

27 AS-ET-OR Caps Team Irisa 27 4-cluster WSRS architecture : nothing comes from free (2)  Extra free lists and associated logic  Extra pipeline stage(s):  Instructions must be allocated to clusters before the last step in register renaming: + 3 cycles  But shorter register access time : - 1 or 2 cycles

28 AS-ET-OR Caps Team Irisa 28 Performance issues on 4-cluster 8-way architectures  Workload may be unbalanced among the clusters:  Use of the degrees of freedom monadic instructions « commutative » clusters  Higher probability of local consumption of a register Naive allocation policies on WSRS compete with naive policies on conventional architecture

29 AS-ET-OR Caps Team Irisa 29 Orthogonal to most previous works Just apply previous proposals at cluster level

30 AS-ET-OR Caps Team Irisa 30 Summary  Register Write Specialization  limits power consumption, silicon area and access time  does not impair performance  But Some extra complexity in register renaming

31 AS-ET-OR Caps Team Irisa 31 Summary (2)  Register Write Specialization + Register Read Specialization  further limits power consumption, silicon area and access time on register file  limits wake-up logic and bypass network complexity  But  constraints instruction allocation to clusters

32 AS-ET-OR Caps Team Irisa 32 Future works  Intelligent instruction allocation policies  Exploration of other possible interconnections  SMT mode


Download ppt "1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec,"

Similar presentations


Ads by Google