Multicore Processors (1)

Multicore Processors (1)
Dezső Sima Fall 2007 (Ver. 2.2)  Dezső Sima, 2007

Foreword In a relative short time interval of about two years multicore processors became the standard. As multiple cores provide a substantially higher processing performance than their predecessors, their memory and I/O subsystems need also be largely enhanced to avoid processing bottlenecks. Consequently, while discussing multicore processors the focus of interest shifts from their microarchitecture to their macroarchitecture incorporating their cache hierarchy, the way how memory and I/O controllers are attached and how on-chip interconnects needed are implemented. Accordingly, the first nine Sections are devoted to the design space of the macroachitecture of multicore processors. Particular Sections deal with the main dimensions that span the design space and identify how occuring concepts became implemented in major multicore lines. Section 10 is a huge repository, providing relevant materials and facts published in connection with major multicore lines. In a concrete course appropriate parts of the repository can be selected according to the scope and attendance of the course. Each part of the repository concludes with a list of available literature to allow a more detailed study of a multicore line or processor of interest.

Overview 1 Introduction 2 Overview of MCPs
3 The design space of the macroarchitecture of MCPs 4 Layout of the cores 5 Layout of L2 caches 6 Layout of L3 caches 7 Layout of the memory and I/O architecture 8 Implementation of the on-chip interconnections 9 Basic alternatives of the macro-architecture of MCPs 10 Multicore repository

1. Introduction

Figure 1.1: Single-stream performance vs. cost
1. Introduction (1) Figure 1.1: Single-stream performance vs. cost Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp. 4-16

1. Introduction (2) Figure 1.2: Historical evolution of transistor counts, clock speed, power and ILP

1. Introduction (3) Since ~ 2003/2004: For single core processors: Increasing number of transistors diminishing return in performance Moore’s law: Transistor count: doubles approximately every 24 month Use available surplus transistors for multiple cores

1. Introduction (4) The era of single core processors the era of multi core processors

1. Introduction (5) The speed of change: illustrated by Intel’s roadmap revealed in 2005:

1. Introduction (6) Figure 1.3: Intel’s roadmap revealed in 2005

core count is doubling approximately every 24 month as well.
1. Introduction (7) Expected increase of the core count: As far as Moore’s law remains valid, transistor count is doubling approximately every 24 month, accordingly core count is doubling approximately every 24 month as well.

2. Overview of MCPs

Figure 2.1: Intel’s dual-core mobile processor lines
2. Overview of MCPs (1) 8CST QCMT QCST DCMT DCST 2005 1Q 2Q 2006 3Q 4Q Core Duo 1/06 (Yonah Duo) 151 mtrs./31 W 65 nm/90 mm2 Core 2 Duo 7/06 281 mtrs./34 W (Merom) T5000/T7000 65 nm/143 mm2 T2000 Figure 2.1: Intel’s dual-core mobile processor lines

Figure 2.2: Intel’s multi-core processor lines
2. Overview of MCPs (2) 8CST QCMT QCST DCMT DCST 2005 2006 1Q 2Q 3Q 4Q 230 mtrs./95 W Pentium D 800 (Smithfield) 5/05 90 nm/2*103 mm2 Core 2 Duo E6000/X6800 (Conroe) 291 mtrs./65/75 W 7/06 65 nm/143 mm2 Pentium D 900 (Presler) 2*188 mtrs./130 W 65 nm/2*81 mm2 1/06 Core 2 Extrem QX6700 (Kentsfield) 2*291 mtrs./130 W 11/06 65 nm/2*143 mm2 Pentium EE 955/965 2-way MT/core Pentium EE 840 230 mtrs./130 W Figure 2.2: Intel’s multi-core processor lines

Figure 2.3: Intel’s dual-core Xeon UP-line
2. Overview of MCPs (3) 8CST QCMT QCST DCMT DCST 2006 1Q 2Q 2007 3Q 4Q Xeon 3000 9/06 291 mtrs./65 W 65 nm/143 mm2 (Conroe) Figure 2.3: Intel’s dual-core Xeon UP-line

Figure 2.4.: Intel’s dual-core Xeon DP-lines
2. Overview of MCPs (4) 8CST QCMT QCST DCMT DCST 2005 1Q 2Q 2006 3Q 4Q Xeon DP 2.8 10/05 2*169 mtrs./135 W 90 nm/2*135 mm2 (Paxville DP) Xeon 5300 11/06 2*291 mtrs./80/120 W 65 nm/2*143 mm2 (Clovertown) Xeon 5100 6/06 291 mtrs./65/80 W 65 nm/143 mm2 (Woodcrest) 2-way MT/core 2*188 mtrs./95/130 W Xeon 5000 65 nm/2*81 mm2 (Dempsey) Figure 2.4.: Intel’s dual-core Xeon DP-lines

Figure 2.5.: Intel’s dual-core Xeon MP-lines
2. Overview of MCPs (5) 8CST QCMT QCST DCMT DCST 2005 1Q 2Q 2006 3Q 4Q Xeon 7000 11/05 2*169 mtrs./95/150 W 90 nm/2*135 mm2 (Paxville MP) 2-way MT/core 1328 mtrs./95/150 W Xeon 7100 8/06 65 nm/435 mm2 (Tulsa) Figure 2.5.: Intel’s dual-core Xeon MP-lines

Figure 2.6.: Intel’s dual-core EPIC-based server line
2. Overview of MCPs (6) 8CST QCMT QCST DCMT DCST 2005 1Q 2Q 2006 3Q 4Q 2-way MT/core 9x00 (Montecito) 1720 mtrs./104 W 7/06 90 nm/596 mm2 Figure 2.6.: Intel’s dual-core EPIC-based server line

Figure 2.7.: AMD’s multi-core desktop processor lines
2. Overview of MCPs (7) 8CST Athlon64 QC (Barcelona) 9/2007 QCMT QCST DCMT DCST 2005 2006 1Q 2Q 3Q 4Q 3800/4200/4600 (Manchester S. E2) 154 mtrs./89/110 W 5/05 90 nm/2*147 mm2 (Toledo S. E6) 233 mtrs./89/110 W 90 nm/199 mm2 (Brisbane S. G1) 154 mtrs./65 W 12/06 65 nm/126 mm2 (Windsor S. F2/F3) 154/227 mtrs./65/89 W 5/06 90 nm/183/230 mm2 2007 65 nm/291 mm2 Figure 2.7.: AMD’s multi-core desktop processor lines

2. Overview of MCPs (8) 8CST QCMT QCST DCMT DCST 2005 2006 1Q 2Q 3Q 4Q Athlon 64 FX (Toledo/Windsor) 233/227 mtrs./110/125 W 1/06 90 nm/199/230 mm2 Figure 2.8.: AMD’s dual-core high-end desktop/entry level server Athlon 64 FX line

Figure 2.9.: AMD’s dual-core Opteron UP-lines
2. Overview of MCPs (9) 8CST QCMT QCST DCMT DCST 2005 2006 1Q 2Q 3Q 4Q 233 mtrs./110 W Opteron 100 8/05 90 nm/199 mm2 (Denmark) Opteron 1000 8/06 227 mtrs./103/125 W 90 nm/230 mm2 (Santa Ana) Figure 2.9.: AMD’s dual-core Opteron UP-lines

Figure 2.10.: AMD’s dual-core Opteron DP-lines
2. Overview of MCPs (10) 8CST QCMT QCST DCMT DCST 2005 2006 1Q 2Q 3Q 4Q 227 mtrs./95/110 W Opteron 2000 8/06 90 nm/230 mm2 (Santa Rosa) Opteron 200 3/05 233 mtrs./95 W 90 nm/199 mm2 (Italy) Figure 2.10.: AMD’s dual-core Opteron DP-lines

Figure 2.11.: AMD’s dual-core Opteron MP-lines
2. Overview of MCPs (11) 8CST QCMT QCST DCMT DCST 2005 2006 1Q 2Q 3Q 4Q 243 mtrs./95/119 W Opteron 8000 8/06 90 nm/220 mm2 (Santa Rosa) Opteron 800 4/05 233 mtrs./95 W 90 nm/199 mm2 (Egypt) Figure 2.11.: AMD’s dual-core Opteron MP-lines

Figure 2.12.: IBM’s multi-core server lines
2. Overview of MCPs (12) 8CST 2006 Cell BE 90 nm/221 mm2 234 mtrs./95 W (PPE:2-way MT) QCMT (SSEs: no MT) QCST DCMT 5/04 10/05 5/2007 POWER5 POWER5+ POWER6 130 nm/389 mm2 90 nm/230 mm2 65 nm/341 mm2 276 mtrs./80W (est.) 276 mtrs./70 W 750 mtrs./~100W 2-way MT/core 2-way MT/core 2-way MT/core DCST 10/01 11/02 POWER4 POWER4+ 180 nm/412 mm2 130 nm/380 mm2 174 mtrs./115/125 W 184 mtrs./70 W ~ ~ ~ 3Q 4Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 3Q 4Q 1Q 2Q 2001 2002 2004 2006 2007 Figure 2.12.: IBM’s multi-core server lines

Figure 2.13.: Sun’s and Fujitsu’s multi-core server lines
2. Overview of MCPs (13) 8CST QCMT QCST DCMT DCST 2004 1Q 2Q 2005 3Q 4Q UltraSPARC IV+ 9/05 295 mtrs./90 W (Panther) 2007 2008 66 mtrs./108 W UltraSPARC IV 7/04 (Jaguar) ~ 8CMT UltraSPARC T1 11/2005 279 mtrs./63 W (Niagara) UltraSPARC T2 (Niagara II) APL SPARC64 VI 540 mtrs./120 W (Olympus) APL SPARC64 VII (Jupiter) 130 nm/352 mm2 90 nm/335 mm2 90 nm/421 mm2 65 nm/464 mm2 65 nm/342 mm2 90 nm/379 mm2 80 mtrs./32 W Gemini 4/04 130 nm/206 mm2 4-way MT/core 8-way MT/core 2-way MT/core ~120 W 72 W (est.) Figure 2.13.: Sun’s and Fujitsu’s multi-core server lines

Figure 2.14.: HP’s PA-8x00 server line
2. Overview of MCPs (14) 8CST QCMT QCST DCMT DCST 2004 1Q 2Q 2005 3Q 4Q PA 8800 2/2004 300 mtrs./55 W (Mako) PA 8900 5/05 317 mtrs. (Shortfin) 130 nm/366 mm2 Figure 2.14.: HP’s PA-8x00 server line

Figure 2.15.: RMI’s multi-core XLR-line (scalar RISC)
2. Overview of MCPs (15) 8CST QCMT QCST DCMT DCST 2005 1Q 2Q 3Q 4Q 2006 8CMT XLR 5xx 5/05 333 mtrs./10-50 W 90 nm/~220 mm2 4-way MT/core Figure 2.15.: RMI’s multi-core XLR-line (scalar RISC)

3. Design space of the macro architecture
of MCPs

3. Design space of the macro architecture of MCPs (1)
Building blocks of multicore processors (MC) Cores L2 I L2 D L3 IN FSB FSB c. C L2 cache(s) (L2) L3 cache(s), if available (L3) Interconnection network (IN) FSB controller (FSB c.) or Bus controller (B. c.) Memory controller (M. c.)

3. Design space of the macro architecture of MCPs (2)
Macro architecture of multi-core (MC) processors Layout of the cores Layout of L3 caches (if available) Layout of the interconnections Layout of L2 caches Layout of the I/O and memory architecture

4. Layout of the cores

5. Layout of the cores (1) Macro architecture of multi-core (MC) processors Layout of the cores Layout of L3 caches (if available) Layout of the on-chip interconnections Layout of L2 caches Layout of the I/O and memory architecture

Physical implementation
4. Layout of the cores (2) Layout of the cores Basic layout Advanced features Physical implementation

4. Layout of the cores (3) Basic layout SMC (Symmetrical MC) HMC
(Heterogeneous MC) All other MCs Cell BE (1 PPE+8 SPE)

4. Layout of the cores (4) Advanced features of the cores
Virus protection Power/clock speed management Support of virtualisation Data protection Remote management Details on the next slide!

Power/clock speed management Support of virtualisation
4. Layout of the cores (5) Virus protection Power/clock speed management Support of virtualisation Data protection Remote management Intel Pentium M-based1 ED-bit EIST 5/02 Netburst Prescott-based2 ED-bit EIST (partly) Cedar Mill-based3 ED-bit EIST (partly) P Vanderpool (partly) Core-based4 ED-bit EIST (partly) Vanderpool (partly) La Grande AMT2 Itanium 2 DBS Silvervale AMD Athlon 64 XR Nx-bit Cool 'n' Quiet Pacifica (partly) Athlon 64 FX Nx-bit Cool 'n' Quiet Pacifica Opteron x00 lines Nx-bit PowerNow! AMD - V x000 lines Nx-bit PowerNow! AMD - V Presidio 1 Pentium M-based line: Core Duo T2000 3 Cedar Mill based desktop lines: Pentium D 9xx, EE 955,965 2 Prescott based desktop lines:Pentium D 8xx, EE 840 Xeon lines: (Dempsey), 7100 (Tulsa) Xeon lines: Paxwille DP, MP 4 Core basd mobile line: T5xxx/T7xxx (Merom) desktop line: Core 2 Duo E6xx, X6xxx, QX67xx Xeon lines: 3000, 5100 (Woodcrest), 5300 (Clovertown) Table 4.1 : Comparison of advanced features provided by Intel’s and AMD’s MC lines. For a brief introduction of the related Intel techniques see the appropriate Intel processor descriptions

EIST: Enhanced Intel SpeedStep Technology
4. Layout of the cores (6) Brief description of the advanced features pointed out while using Intel’s notations (1) ED: Execute Disable bit (often designated as the NX bit (No Execute bit)) The ED bit with OS support is now a widely used technique to prevent certain classes of malicious buffer overflow attacks. It is used e.g. in Intel’s Itanium, Pentium M, Pentium 4 Prescott and subsequent processors as well as in AMD’s x86-64 line (designated as the NX bit (No Execute bit). In buffer overflow attacks a malicious worm inserts its code into another program’s data space and creates a flood of code that overwhelms the processor, allowing the worm to propagate itself to the network, and to other computers. The Execute Disable bit allows the processor to classify areas in memory by where application code can be executed and where it cannot. Actually, the ED bit is implemented as the 63. bit of the Page Table Entry. If its value is 1, code cannot be executed from that page. When a malicious worm attempts to insert code in the buffer, the processor disables code execution, preventing damage and worm propagation. To became active the ED feature need to be supported by the OS. For more information see: EIST: Enhanced Intel SpeedStep Technology First implemented in Intel’s mobile and server platforms, It allows the system to dynamically adjust processor voltage and core frequency, to decrease average power consumption and thus average heat production. This technology is similar to AMD’s Cool’n’Quiet and PowerNow! techniques. For more information see

VT: Virtualization Technology
4. Layout of the cores (7) Brief description of the advanced features pointed out while using Intel’s notations (2) VT: Virtualization Technology Virtualisation techniques allow a platform to run multiple operating systems and applications in independent partitions. The hardver support improves the performance and robustness of traditional software-based virtualization solutions. Intel introduced hardware virtualisation support first in its Itanium line (called the Silvervale tecnique) followed by its Presler based and subsequent processors (designated as the Vanderpool technique). AMD uses this technique in its recent lines - dubbed as Pacifica - since Mai For more information see La Grande Technology It is a Trusted Execution Technology to protect users from software-based attacks aimed at stealing vital data, initiated by Intel. It includes a set of hardware enhancements intended to provide multiple separated execution environments. For more information see Intel® Trusted Execution Technology Preliminary Architecture Specification at or AMT2 Intel’s Active Management Technology is a feature of its vPro technology. It allows to manage the computer system remotely, even if the computer is switched off or the hard drive has failed. By means of this technology system administrators can e.g. remotely download software updates or fix and hail system problems. It can also be utilized to protect the network by proactively blocking incoming threats. For more information see Intel Active Management Technology at

4. Layout of the cores (8) Physical implementation
Monolithic implementation Multi-chip implementation Yonah Duo Pentium D 8xx Paxville DP Pentium EE 840 Paxville MP Pentium D 9xx Tulsa Pentium EE 955/965 Dempsey Merom Conroe Kentsfield Clovertown Woodcrest Montecito All DCs of AMD, IBM, Sun Fujitsu, HP and RMI

5. Layout of L2 caches

5. Layout of L2 caches (1) Macro architecture of multi-core (MC) processors Layout of the cores Layout of L3 caches (if available) Layout of the on-chip interconnections Layout of L2 caches Layout of the I/O and memory architecture

5. Layout of L2 caches (2) Layout of L2 caches Allocation to the cores
Use by instructions/data Integration of L2 caches to the proc. chip Inclusion policy Banking policy

5. Layout of L2 caches (3) Allocation of L2 caches to the cores
Private L2 cache for each core Shared L2 cache for all cores Examples: IN (Xbar) Memory System Request Queue B. c. M. c. HT-bus L2 Core L2 contr. Core L2 FSB c. FSB Athlon 64 X2 (2005) Core Duo (2006) Core 2 Duo (2006)

5. Layout of L2 caches (4) Allocation of L2 caches to the cores
Private L2 cache for each core Shared L2 cache for all cores Smithfield, Presler, Irwindale and Cedar Mill based lines (2005/2006) Core Duo (Yonah), Core 2 Duo (Core) based lines (2006) Montecito (2006) Athlon 64 X2 and AMD's Opteron lines (2005/2006) POWER4 (2001) POWER6 (2007) POWER5 (2005) Cell BE (2006) UltraSPARC IV (2004) UltraSPARC IV+ (2005) UltraSPARC T1 (2005) UltraSPARC T2 (2005) SPARC64 VI (2007) SPARC64 VII (2008) PA 8800 (2004) PA 8900 (2005) XLR (2005)

IBM: The effect of changing from shared L2 to private L2
inprove latency eliminating any impact for multicore competition for capacity. Source: IBM JRD Nov 2007

5. Layout of L2 caches (5) Attaching L2 caches to MCPs
Allocation to the cores Use by instructions/data Integration of L2 caches to the proc. chip Inclusion policy Banking policy

Inclusion policy of L2 caches
5. Layout of L2 caches (6) Exclusive L2 Inclusion policy of L2 caches Inclusive L2 Replaced (victimised) lines L1 L1 L2 (Victim cache) Lines missing in L1 are reloaded and deleted from L2 Data missing in L1/L2 (high traffic) Modified data (low traffic) L2 M.c. Memory Memory Lines replaced (victimized) in the L1 are written to L2. References to data missing in L1 but available in L2 initiate reloading the pertaining cache line to L1. Reloaded data is deleted from L2 L2 operates usually as a write back cache (only modified data that is replaced in the L2 is written back to the memory), Unmodified data that is replaced in L2 is deleted.

Figure 5.1: Implementation of exclusive L2 caches
5. Layout of L2 caches (7) Figure 5.1: Implementation of exclusive L2 caches Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”, IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2004, pp

5. Layout of L2 caches (8) Exclusive L2 Inclusion policy of L2 caches Inclusive L2 Most implementations Athlon 64X2 (2005) Dual-core Opteron lines (2005) Expected trend

5. Layout of L2 caches (10) Use by instructions/data
Unified instr./data cache(s) Use by instructions/data Split instr./data cache(s) Typical implementation Montecito (2006) Expected trend

Single-banked implementation Multi-banked implementation
5. Layout of L2 caches (12) Single-banked implementation Banking policy Multi-banked implementation Typical implementation

Mapping of addresses to memory modules
5. Layout of L2 caches (13) Mapping of addresses to memory modules POWER4 (2001) POWER5 (2005) Core X-bar L2 Fabric Bu SContr. IN (Fabric Bus Contr.) L3 tags/ contr. B. c. GX bus UltraSPARC T1 (2005) (Niagara) (8 cores/4xL2 banks) Core X-bar L2 Memory M. c. Mapping of addresses to the banks: Mapping of addresses to the banks: The 128-byte long L2 cache lines are hashed across The four L2 modules are interleaved at 64-byte blocks. the 3 modules. Hashing is performed by modulo 3 arithmetric applied on a large number of real address bits. 6 7 2 1 Modulo 3 ∑ 256 64 128 196 Addr. Figure 5.2: Alternatives of mapping addresses to memory modules

Integration to the processor chip Partial integration of L2
5. Layout of L2 caches (15) Integration to the processor chip Partial integration of L2 Full integration of L2 On-chip L2 tags/control, off-chip data Entire L2 on-chip UltraSPARC IV (2004) UltraSPARC V (2005) PA 8800 (2004) PA 8900 (2005) All other lines considered Expected trend

6. Layout of L3 caches

6. Layout of L3 caches (1) Macro architecture of multi-core (MC) processors Layout of the cores Layout of L3 caches (if available) Layout of the interconnections Layout of L2 caches Layout of the I/O and memory architecture

6. Layout of L3 caches (2) Layout of L3 caches
Allocation to the L2 cache(s) Use by instructions/data Integration of L3 caches to the proc. chip Inclusion policy Banking policy

6. Layout of L3 caches (3) Allocation of L3 caches to the L2 caches
Shared L3 cache for all L2s Allocation of L3 caches to the L2 caches Private L3 cache for each L2 Montecito (2006) Athlon 64 X2–Barcelona (2007) POWER5 (2005) POWER4 (2001) UltraSPARC IV+ (2004)

6. Layout of L3 caches (5) Inclusion policy of L3 caches Inclusive L3 Exclusive L3 Replaced (victimised) lines L2 L2 L3 (Victim cache) Lines missing in L2 are reloaded and deleted from L3 Data missing in L2/L3 (high traffic) Modified data (low traffic) L3 M.c. Memory Memory Lines replaced (victimized) in the L2 are written to L3. References to data missing in L2 but available in L3 initiate reloading the pertaining cache line to L2. Reloaded data is deleted from L3 L3 operates usually as a write back cache (only modified data that is replaced in the L3 is written back to the memory), Unmodified data that is replaced in L3 is deleted.

6. Layout of L3 caches (6) Exclusive L3 Inclusion policy of L3 caches Inclusive L3 Montecito (2006) Athlon 64 X2–Barcelona (2007) POWER4 (2001) POWER5 (2005) POWER6 (2007) UltraSPARC IV+ (2005) Expected trend

6. Layout of L3 caches (8) Use by instructions/data
Unified instr./data cache(s) Use by instructions/data Split instr./data caches All multicore processors unveiled until now hold both instructions and data

Single-banked implementation Multi-banked implementation
6. Layout of L3 caches (10) Single-banked implementation Banking policy Multi-banked implementation All multicore processors unveiled until now are multi-banked

Integration to the processor chip L3 partially integrated
6. Layout of L3 caches (12) Integration to the processor chip L3 partially integrated L3 fully integrated On-chip L3 tags/control, off-chip data Entire L3 on-chip UltraSPARC IV+ (2005) POWER4 (2001) POWER5 (2005) POWER6 (2007) Montecito (2006) Expected trend

Examples of partially integrated L3 caches:
6. Layout of L3 caches (13) Examples of partially integrated L3 caches: Fabric Bus Contr. L2 Memory M. c. L3 tags/contr. L3 data Core L3 tags/contr. L3 data Interconn. network M. c.. Memory B. c. Fire Plane bus L2 POWER5 (2005): UltraSPARC IV+ (2005):

6. Layout of L3 caches (14) Cache architecture L2 partially integrated, no L3 L2 fully integrated, no L3 L2 fully integrated, L3 partially integrated L2 fully integrated, L3 fully integrated L2 private L2 shared L2 private L2 shared L2 private L2 shared L2 private L2 shared L3 private L3 shared L3 private L3 shared L3 private L3 shared L3 private L3 shared PA-8800 UltraSPARC IV Montecito POWER6 Core Duo (Yonah) Core 2 Duo based proc-s Cell BE UltraSPARC T1/T2 SPARC 64 VI/VII XLR POWER4 POWER5(excl.) UltraSPARC IV+ (excl.) PA-8900 Smithfield/Presler based proc.s Athlon 64X2 (excl.) Opteron dual core (excl.) Gemini Figure 6.1: Cache architecture of MC processors (Not differentiating between inclusive/exclusive alternatives in the figure, but denotingit in the examples)

6. Layout of L3 caches (15) Examples for cache architectures (1)
L2 partially integrated, private no L3 L2 partially integrated, shared no L3 Core Interconn. network M. c. Memory B. c. Fire Plane bus L2 data L2 tags/contr. L2 data Core L2 tags/contr. FSB c. FSB UltraSPARC IV (2004) PA-8800 (2004) PA-8900 (2005)

L2 fully integrated, private no L3 L2 fully integrated, shared no L3 IN (Xbar) Memory System Request Queue B. c. M. c. HT-bus L2 Core L2 contr. Core L2 FSB c. FSB L2 M. c. B. c. Core 7 Core 0 X b a r Memory JBus Athlon 64 X2 (2005) Core Duo (2006) UltraSPARC T1 (2005) Core 2 Duo (2006)

L2 fully integrated, shared L3 partially integrated L3 data L3 tags/contr. L2 L2 L2 Chip-to-chip/ L2 Mem.-to-Mem. interconn. Core Core IN (Fabric Bus Contr.) Interconn. network B. c. L3 tags contr. B. c. M. c. Fire Plane bus Memory GX-bus L3 data M. c. Memory POWER4 (2001) UltraSPARC IV+ (2005)

Examples for cache architectures (4) L2 fully integrated, private,
6. Layout of L3 caches (18) Examples for cache architectures (4) L2 fully integrated, private, L3 fully integrated Core Core L2 I L2 D L2 I L2 D L3 L3 FSB c. FSB Montecito (2006)

7. Layout of memory and I/O architecture

7. Layout of memory and I/O architecture (1)

Layout of the I/O and memory connection The concept of connection Point of attachment Physical implementations of the M. c.

The concept of connection Connection via the FSB Connection via the M. c and B. c Used typically in connection with off-chip M. c.-s Used in connection with on-chip M. c.-s Examples: IN Core L2 FSB c. FSB Core L2 I L2 D L3 FSB c. FSB L2 M. c.. B. c.. Core 7 Core 0 X b a r Memory JBus Core 2 Duo Montecito UltraSPARC T1

The point of attachment The highest cache level (via an IN) The IN connecting the two highest cache levels L2 (private/shared) L3 (private/shared) The IN connecting the cores and a shared L2 The IN connecting a shared L2 and a shared L3 (2-level caches) (3-level caches) (2-level caches) (3-level caches) Examples: C C L2 L2 L2 L3 L3 L2 L2 IN IN IN1 IN L2 L2 L3 L3 L3

Exclusive L3 Inclusion policy of L3 caches Inclusive L3 Cache inclusivity vs. memory attachment Replaced (victimised) lines L2 L2 L3 Lines missing in L2 are reloaded and deleted from L3 Data missing in L2/L3 (high traffic) Modified data (low traffic) L3 M.c. Memory Memory L2 L2 L3 L3 IN M.c. M.c. L3 L3 Memory Memory Montecito (2006) POWER4 (2001) UltraSPARC IV+ (2004) POWER5 (2004)

The point of attachment The highest cache level (via an IN) The IN connecting the two highest cache levels L2 (private/shared) L3 (private/shared) The IN connecting the cores and a shared L2 The IN connecting a shared L2 and a shared L3 (2-level caches) (3-level caches) (2-level caches) (3-level caches) Examples: C C L2 L2 L2 L3 L3 L2 L2 IN IN IN1 IN L2 L2 L3 L3 L3 The M. c is connected usually in this way if the highest level cache is inclusive. The M. c is connected usually in this way if the highest level cache is exclusive.

Examples for attaching memory and I/O via the highest cache level In case of a two-level cache hierarchy In case of a three-level cache hierarchy L2 contr. Core L2 FSB c. FSB IN (Xbar) Memory System Request Queue B. c. M. c. HT-bus L2 Core Core L2 I L2 D L3 FSB c. FSB Core 2 Duo (2006) Athlon 64 X2 (2005) Montecito (2006)

Examples for attaching memory and I/O via the interconnection network connecting the two highest levels of the cache hierarcy In case of a two-level cache hierarchy In case of a three-level cache hierarchy Core L3 tags/contr. L3 data Interconn. network M. c. Memory B. c. Fire Plane bus L2 L2 M. c. B. c. Core 7 Core 0 X b a r Memory JBus UltraSPARC T1 (2005) UltraSPARC IV+ (2005) (exclusive L3)

Integration of the memory controller to the processor chip Off-chip memory controller On-chip memory controller Longer access times, but provides independency of memory technology Shortens access times, but causes dependency of memory technology and speed POWER4 (2001) POWER5 (2005) PA-8800 (2004) UltraSPARC IV (2004) PA-8900 (2005) UltraSPARC IV+ (2005) UltraSPARC T1 (2005) Smithfield (2005) Presler (2005) Athlon 64 X2 (2005) Yonah Duo (2006) Core (2006) Montecito (2006?) Expected trend

8. Layout of the on-chip interconnections

1. Layout of interconnections (1)

On-chip interconnections Between cores and shared L2 modules Between private or shared L2 modules and shared L3 modules Between private or shared L2/L3 modules and the B. c./M. c. or alternatively the FSB c. Chip-to-chip and module-to-module interconnects C C Chip L2 L2 L2/L3 L2/L3 Chip IN1 IN IN Module Chip B. c. M. c. or FSB c. L2 L2 L3 L3 Chip I/O bus Memory FSB

Implementation of interconnections Arbiter/Multi-port cache implementations Crossbar Ring Quantitative aspects, such as the number of sources/destinations or bandwidth requirement affect which implementation alternative is the most beneficial. For a small number of sources/destinations For a larger number of sources/destinations eg. to connect dual-cores to shared L2 caches UltraSPARC T1 (2005) UltraSPARC T2 (2007) Cell BE (2006) XRI (2005)

9. Basic alternatives of macro-architectures of MC processors

9. Basic alternatives of macro-architectures of MC processors (1)
Basic dimensions spanning the design space Kind of related cache architecture Layout of the memory and I/O connection Layout of the IN(s) involved Interrelationships between particular dimensions Cache inclusivity affects the layout of the memory connection

Considering a selected part of the design space of the macroarchitecture of MC processors Assumptions: A two-level cache hierarchy is considered (no L3). L2 is supposed to be partially integrated on the chip (i.e. only the tags are on the chip). This is not indicated in the design space tree but denoted in the examples. We do not consider how the involved INs is implemented. In case of an off-chip M. controller the use of an FSB is assumed.

Design space of the macroarchitecture of MC processors (in case of a two-level cache hierarchy) Off-chip M. c. On-chip M. c. L2 private L2 shared L2 private L2 shared Part of the design-space specifying the point of attachment for the memory and I/O

Design space of the macroarchitecture of MC processors (2-level hierarchy) (1) M . c. off-chip L2 private L2 shared FSB c. IN L2 FSB C FSB c. FSB IN1 L2 IN2 C Smithfield/Presler based processors (2005/2006) (2 cores) PA-8800 (2004) (2 cores, L2 data off-chip) Core 2 Duo based processors (2006) (2 cores) SPARC64 VI (2007) (2 cores) SPARC64 VII (2008) (4 cores)

Design space of the macroarchitecture of MC processors (2-level hierarchy) (2) M. c. on-chip L2 shared L2 private Attaching both I/O and M. c. to L2 via IN2 Attaching M. c. to L2 via IN2 and I/O to IN1 Attaching I/O. to L2 via IN2 and M.c. to IN1 Attaching both I/O and M.c. via IN1 IN L2 B. c. I/O-bus Memory M. c. C IN1 L2 B. c. I/O-bus Memory M. c. C IN1 L2 B. c. I/O-bus Memory M. c. IN2 C I/O-bus IN1 L2 B. c. Memory M. c. IN2 C IN1 L2 M. c. Memory I/O-bus B. c. IN2 C Opteron dual core (2 cores),(2005) UltraSPARC IV (2 cores, L2 data off-chip), (2004) Athlon 64 X2 (2005) (2 cores) UltraSPARC T1 (2005) (8 cores) Cell BE (2006) (8 cores)

Multicore Processors (1)

Similar presentations

Presentation on theme: "Multicore Processors (1)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore Processors (1)

Similar presentations

Presentation on theme: "Multicore Processors (1)"— Presentation transcript:

Similar presentations

About project

Feedback