Download presentation
Presentation is loading. Please wait.
Published byEstella Hancock Modified over 9 years ago
1
1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA
2
AS-ET-OR Caps Team Irisa 2 Why designing wide issue superscalar processors SMT Superscalar Processors !
3
AS-ET-OR Caps Team Irisa 3 Doubling the issue width Functional Units Silicon area: 2x Power consumption: 2x Same latency Register file: Silicon area: > 8x Power consumption: > 4x access time: 1.5x Wake-up logic entries: monitors twice as many inputs area, consumption, response time Bypass network: wider multiplexors >2x longer communications
4
AS-ET-OR Caps Team Irisa 4 An unwritten rule applied on all superscalar processor designs For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit
5
AS-ET-OR Caps Team Irisa 5 The register file issue
6
AS-ET-OR Caps Team Irisa 6 Silicon area for the physical register file
7
AS-ET-OR Caps Team Irisa 7 Conventional clustered design C1C0C2C3 Register File
8
AS-ET-OR Caps Team Irisa 8 Distributed register file C0C1C3C2 Local register file: shorter read access time but larger silicon area
9
AS-ET-OR Caps Team Irisa 9 8-way distributed register file 4 identical copies 14.5 W (x 4.5) 4 cycles (+1) 256 x 1792 w2 x W (x11) 8-way monolithic register file 16 W (x 5) 5 cycles (+2) 256 x 1120 w2 x W (x 8) 4-way distributed register file 2 identical copies 3.1W 3 cycles 128 x 320w2 x W 8-way against 4-way 100nm, 5 Ghz
10
AS-ET-OR Caps Team Irisa 10 Let us reduce the number of ports on each individual register
11
AS-ET-OR Caps Team Irisa 11 Register Write Specialization C1C0C2C3 S0 S1 S2 S3
12
AS-ET-OR Caps Team Irisa 12 Distributed Register File and Register Write Specialization C0C1C3C2
13
AS-ET-OR Caps Team Irisa 13 Register Write Specialization Each cluster writes only a subset of the registers Less write ports on every individual physical register But allocation to clusters must precede register renaming 4-cluster 8-way distributed register file 512 entries 320 x w2 per register bit 3 cycles access time 8.5 W
14
AS-ET-OR Caps Team Irisa 14 Register Write Specialization and Register Renaming 1:Op R6, R7 -> R5 2:Op R2, R5 -> R6 3:Op R6, R3 -> R4 4:Op R4, R6 -> R2 4 free odd reg 4 free even reg 4-bit subset target vector 1:Op L6, L7 -> res1 2:Op L2, res1 -> res2 3:Op res2, L3 -> res3 4:Op res3,res2 -> res4 4 new free registers + Old map table 1:Op P6, P7 -> RES1 2:Op P2, RES1 -> RES2 3:Op RES2, L3 -> RES3 4:Op RES3,RES2 -> RES4 New map table
15
AS-ET-OR Caps Team Irisa 15 Register Write Specialization and Register Renaming (2) Consumes a lot of registers : need for recycling 1:build two lists of registers to be recycled 2: pack both lists 3: concatenate the two lists 4: append to the free list
16
AS-ET-OR Caps Team Irisa 16 Register Write Specialization and Register Renaming (3) An alternative: Compute the number of registers in each register subset Pick the right number of registers from each of the free lists No need for recycling registers Think about round-robin distribution !
17
AS-ET-OR Caps Team Irisa 17 Performance issues Register Write Specialization only: round robin allocation: no extra stage for register renaming shorter register acces time Overall shorter pipeline: slightly better performances
18
AS-ET-OR Caps Team Irisa 18 Register Read Specialization C1C0C2C3 S0 S1
19
AS-ET-OR Caps Team Irisa 19 Register Read Specialization Limits number of read ports on each individual register Puts strong constraints on allocation of instructions to clusters Caution: Personal opinion: don’t use it alone ! Interconnection topology must ensure that every instruction is executable
20
AS-ET-OR Caps Team Irisa 20 WSRS architectures Combining Register Read Specialization and Register Write Specialization
21
AS-ET-OR Caps Team Irisa 21 4-cluster WSRS architecture S0 C0 S1 C1 S2 C2 S3 C3 S2 inst. operands positions determine the execution cluster
22
AS-ET-OR Caps Team Irisa 22 4-cluster WSRS architecture: allocating instructions to clusters S0 C0 S1 C1 S2 C2 S3 C3 S2 Op:R6,R7 R5 S1,S2 S0 First op determines top or down bicluster Second op determines left or right bicluster
23
AS-ET-OR Caps Team Irisa 23 4-cluster WSRS architecture : allocating instructions to clusters (2) Op:R6,R7 R5 S1,S2 S0 Computation of the two bits are independent :-)
24
AS-ET-OR Caps Team Irisa 24 Each individual physical register: 4 identical copies of (2-read, 3-write) registers 8x smaller than conventional monolithic approach 12.8x smaller than conventional distributed approach 4-cluster 8-way WSRS architecture : the register file WSRS 512 registers 6.25W, 3 cycles Conventional 256 registers (16W, 5 cycles) or (14.5W, 4 cycles)
25
AS-ET-OR Caps Team Irisa 25 4-cluster 8-way WSRS architecture : the wake-up logic The wake-up logic monitors all possible sources for each operand FUs from only two clusters are possible sources only 6 possible sources ! 8-way WSRS architecture, wake-up logic entry complexity = 4-way issue wake-up logic entry complexity
26
AS-ET-OR Caps Team Irisa 26 4-cluster 8-way WSRS architecture : bypass network Possible sources for each operand FUs from only two clusters are possible sources Bypass point (pipeline length) x (possible FU sources) + register file 8-way dist. 4 cycles 49 pos. op. WSRS 3 cycles 19 pos. op. 8-way mon. 5 cycles 61 pos. op.
27
AS-ET-OR Caps Team Irisa 27 Local fast-forwarding inside a single cluster 2 out of 4 consumers are reached on the next cycle Partial fast-forwarding inside a pair of adjacent clusters: 3 out of 4 consumers are reached on the next cycle ! Complete fast-forwarding: consumer is close: may be possible to implement! 4-cluster WSRS architecture : fast-forwarding
28
AS-ET-OR Caps Team Irisa 28 4-cluster WSRS architecture: Nothing is entirely free ! Strong constraint on allocation of instructions to clusters: The cluster executing a dyadic instruction depends on the position of its operands in the register subsets. Degrees of freedom: Monadic instructions can be executed on two clusters One out of two commutative dyadic instructions can be executed on two clusters Design clusters able to execute instructions in two forms ? A-B and -B + A
29
AS-ET-OR Caps Team Irisa 29 Using monadic instructions for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 or S1
30
AS-ET-OR Caps Team Irisa 30 Commutativity for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 op S2
31
AS-ET-OR Caps Team Irisa 31 4-cluster WSRS architecture : nothing comes from free (2) Extra free lists and associated logic Extra pipeline stage(s): Instructions must be allocated to clusters before the last step in register renaming: + 3 cycles But shorter register access time : - 2 cycles
32
AS-ET-OR Caps Team Irisa 32 Performance issues on 4-way WSRS architectures Workload may be unbalanced among the clusters: Use of the degrees of freedom monadic instructions « commutative » clusters Higher probability of local consumption of a register Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture
33
AS-ET-OR Caps Team Irisa 33 Summary Register Write Specialization limiting the number of write ports on each physical register leads to naturally use distributed register file mastering power consumption, silicon area and access time But Some extra complexity in register renaming
34
AS-ET-OR Caps Team Irisa 34 Summary (2) Register Write Specialization + Register Read Specialization Further limits the number of ports on each physical register mastering power consumption, silicon area and access time side effects: mastering wake-up logic and bypass network complexity But constraints instruction allocation to clusters
35
AS-ET-OR Caps Team Irisa 35 Future works Intelligent instruction allocation policies Exploration of other possible interconnections Use of heterogeneous clusters SMT mode
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.