Download presentation
Presentation is loading. Please wait.
Published byKristina Lawrence Modified over 9 years ago
1
1 Register Write Specialization Register Read Specialization A path to complexity effective wide-issue superscalar processors André Seznec, Eric Toullec, Olivier Rochecouste IRISA/ INRIA
2
AS-ET-OR Caps Team Irisa 2 Doubling the issue width Functional Units Silicon area: 2x Power consumption: 2x Same latency Register file: Silicon area: > 7x Power consumption: > 4x access time: 1.5x Wake-up logic entries: monitors twice as many inputs area, consumption, response time Bypass network: wider multiplexors >2x longer communications
3
AS-ET-OR Caps Team Irisa 3 An unwritten rule applied on all superscalar processor designs For general purpose registers: Any physical register can be the source or the result of any instruction executed on any functional unit
4
AS-ET-OR Caps Team Irisa 4 Distributed register file C0C1C3C2 Local register file: shorter read access time but larger silicon area
5
AS-ET-OR Caps Team Irisa 5 The register file issue
6
AS-ET-OR Caps Team Irisa 6 Silicon area for the physical register file
7
AS-ET-OR Caps Team Irisa 7 8-way distributed register file 4 identical copies 14.5 W (x 4.5) 4 cycles (+1) 256 x 1792 w2 x W (x11,2) 8-way monolithic register file 16 W (x 5) 5 cycles (+2) 256 x 1120 w2 x W (x 7) 4-way distributed register file 2 identical copies 3.1W 3 cycles 128 x 320w2 x W 8-way against 4-way 100nm, 5 Ghz
8
AS-ET-OR Caps Team Irisa 8 Let us reduce the number of ports on each individual register
9
AS-ET-OR Caps Team Irisa 9 Register Write Specialization C1C0C2C3 S0 S1 S2 S3
10
AS-ET-OR Caps Team Irisa 10 Distributed Register File and Register Write Specialization C0C1C3C2
11
AS-ET-OR Caps Team Irisa 11 Register Write Specialization Each cluster writes only a subset of the registers Less write ports on every individual physical register 4-cluster 8-way distributed register file 512 entries 280 x w2 per register bit: 1/2 or 1/3 of conventional 3 cycles access time : saves 1 or 2 cycles 8.5 W against 14.5 or 16 But allocation must precede register renaming
12
AS-ET-OR Caps Team Irisa 12 Register Write Specialization and Register Renaming 1:Op R6, R7 -> R5 2:Op R2, R5 -> R6 3:Op R6, R3 -> R4 4:Op R4, R6 -> R2 4 free odd reg 4 free even reg 4-bit subset target vector 1:Op L6, L7 -> res1 2:Op L2, res1 -> res2 3:Op res2, L3 -> res3 4:Op res3,res2 -> res4 4 new free registers + Old map table 1:Op P6, P7 -> RES1 2:Op P2, RES1 -> RES2 3:Op RES2, P3 -> RES3 4:Op RES3,RES2 -> RES4 New map table
13
AS-ET-OR Caps Team Irisa 13 Register Write Specialization and Register Renaming (2) Consumes a lot of registers : need for recycling 1:build two lists of registers to be recycled 2: pack both lists 3: concatenate the two lists 4: append to the free list
14
AS-ET-OR Caps Team Irisa 14 Register Write Specialization and Register Renaming (3) An alternative: Compute the number of registers in each register subset Pick the right number of registers from each of the free lists No need for recycling registers Think about round-robin distribution !
15
AS-ET-OR Caps Team Irisa 15 Performance issues Register Write Specialization only: round robin allocation: no extra stage for register renaming shorter register acces time Overall shorter pipeline: slightly better performance
16
AS-ET-OR Caps Team Irisa 16 Register Read Specialization C1C0C2C3 S0 S1
17
AS-ET-OR Caps Team Irisa 17 Register Read Specialization Limits number of read ports on each individual register Puts strong constraints on allocation of instructions to clusters Caution: Personal opinion: don’t use it alone ! Interconnection topology must ensure that every instruction is executable
18
AS-ET-OR Caps Team Irisa 18 WSRS architectures Combining Register Read Specialization and Register Write Specialization
19
AS-ET-OR Caps Team Irisa 19 4-cluster WSRS architecture S0 C0 S1 C1 S2 C2 S3 C3 S2 inst. operands positions determine the execution cluster
20
AS-ET-OR Caps Team Irisa 20 4-cluster WSRS architecture: allocating instructions to clusters S0 C0 S1 C1 S2 C2 S3 C3 S2 Op:R6,R7 R5 S1,S2 S0 First op determines top or down bicluster Second op determines left or right bicluster
21
AS-ET-OR Caps Team Irisa 21 4-cluster WSRS architecture : allocating instructions to clusters (2) Op:R6,R7 R5 S1,S2 S0 Computation of the two bits are independent :-)
22
AS-ET-OR Caps Team Irisa 22 Each individual physical register: 2 identical copies of (4-read, 3-write) registers 8x smaller than conventional monolithic approach 12.8x smaller than conventional distributed approach 4-cluster 8-way WSRS architecture : the register file WSRS 512 registers 6.25W, 3 cycles Conventional 256 registers (16W, 5 cycles) or (14.5W, 4 cycles)
23
AS-ET-OR Caps Team Irisa 23 4-cluster 8-way WSRS architecture : the wake-up logic The wake-up logic monitors all possible sources for each operand FUs from only two clusters are possible sources 8-way WSRS architecture, wake-up logic entry complexity = 4-way issue wake-up logic entry complexity
24
AS-ET-OR Caps Team Irisa 24 4-cluster 8-way WSRS architecture : bypass network Possible sources for each operand FUs from only two clusters are possible sources Bypass point (pipeline length) x (possible FU sources) + register file 8-way dist. 4 cycles 49 pos. op. WSRS= 4-way 3 cycles 19 pos. op. 8-way mon. 5 cycles 61 pos. op.
25
AS-ET-OR Caps Team Irisa 25 4-cluster WSRS architecture: Nothing is entirely free ! Strong constraint on allocation of instructions to clusters: The cluster executing a dyadic instruction depends on the position of its operands in the register subsets. Degrees of freedom: Monadic instructions can be executed on two clusters Three out of four commutative dyadic instructions can be executed on two distinct clusters Design clusters able to execute instructions in two forms ? A-B and -B + A
26
AS-ET-OR Caps Team Irisa 26 Using monadic instructions for load balancing S0 C0 S1 C1 S2 C2 S3 C3 S2 S0 or S1
27
AS-ET-OR Caps Team Irisa 27 4-cluster WSRS architecture : nothing comes from free (2) Extra free lists and associated logic Extra pipeline stage(s): Instructions must be allocated to clusters before the last step in register renaming: + 3 cycles But shorter register access time : - 1 or 2 cycles
28
AS-ET-OR Caps Team Irisa 28 Performance issues on 4-cluster 8-way architectures Workload may be unbalanced among the clusters: Use of the degrees of freedom monadic instructions « commutative » clusters Higher probability of local consumption of a register Naive allocation policies on WSRS compete with naive policies on conventional architecture
29
AS-ET-OR Caps Team Irisa 29 Orthogonal to most previous works Just apply previous proposals at cluster level
30
AS-ET-OR Caps Team Irisa 30 Summary Register Write Specialization limits power consumption, silicon area and access time does not impair performance But Some extra complexity in register renaming
31
AS-ET-OR Caps Team Irisa 31 Summary (2) Register Write Specialization + Register Read Specialization further limits power consumption, silicon area and access time on register file limits wake-up logic and bypass network complexity But constraints instruction allocation to clusters
32
AS-ET-OR Caps Team Irisa 32 Future works Intelligent instruction allocation policies Exploration of other possible interconnections SMT mode
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.