Dezső Sima Fall 2007 (Ver. 2.2)  Dezső Sima, 2007 Multicore Processors (1)

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Performance of Cache Memory
III. Multicore Processors (4) Dezső Sima Spring 2007 (Ver. 2.1)  Dezső Sima, 2007.
Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
CS 104 Introduction to Computer Science and Graphics Problems
Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
Intel® 64-bit Platforms Platform Features. Agenda Introduction and Positioning of Intel® 64-bit Platforms Intel® 64-Bit Xeon™ Platforms Intel® Itanium®
Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.
III. Multicore Processors (3)
| © 2006 Lenovo Processor roadmap R.Vodenicharov.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Current Computer Architecture Trends CE 140 A1/A2 29 August 2003.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Lecture 19: Virtual Memory
Intel’s Penryn Sima Dezső Fall 2007 Version nm quad-core -
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Dezső Sima Fall 2007 (Ver. 2.1)  Dezső Sima, 2007 Multicore Processors (2)
III. Multicore Processors (1) Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.
Shashwat Shriparv InfinitySoft.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Memory Hierarchy. Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CS203 – Advanced Computer Architecture Virtual Memory.
Computer Organization IS F242. Course Objective It aims at understanding and appreciating the computing system’s functional components, their characteristics,
Sima Dezső 2007 őszi félév (Ver. 2.1)  Dezső Sima, 2007 Többmagos Processzorok (3)
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Introduction CSE 410, Spring 2005 Computer Systems
CS 704 Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
Lecture: Large Caches, Virtual Memory
CS703 - Advanced Operating Systems
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Lecture: Large Caches, Virtual Memory
ECE 445 – Computer Organization
Lecture 12: Cache Innovations
III. Multicore Processors (2)
11. Multicore Processors Dezső Sima Fall 2006  D. Sima, 2006.
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: Cache Innovations, Virtual Memory
Többmagos Processzorok (2)
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Microarchitecture of Superscalars (4) Decoding
CSE451 Virtual Memory Paging Autumn 2002
Multicore Processors (1)
Lecture: Cache Hierarchies
Presentation transcript:

Dezső Sima Fall 2007 (Ver. 2.2)  Dezső Sima, 2007 Multicore Processors (1)

Foreword In a relative short time interval of about two years multicore processors became the standard. As multiple cores provide a substantially higher processing performance than their predecessors, their memory and I/O subsystems need also be largely enhanced to avoid processing bottlenecks. Consequently, while discussing multicore processors the focus of interest shifts from their microarchitecture to their macroarchitecture incorporating their cache hierarchy, the way how memory and I/O controllers are attached and how on-chip interconnects needed are implemented. Accordingly, the first nine Sections are devoted to the design space of the macroachitecture of multicore processors. Particular Sections deal with the main dimensions that span the design space and identify how occuring concepts became implemented in major multicore lines. Section 10 is a huge repository, providing relevant materials and facts published in connection with major multicore lines. In a concrete course appropriate parts of the repository can be selected according to the scope and attendance of the course. Each part of the repository concludes with a list of available literature to allow a more detailed study of a multicore line or processor of interest.

2 Overview of MCPs 5 Layout of L2 caches 9 Basic alternatives of the macro-architecture of MCPs 7 Layout of the memory and I/O architecture Overview 6 Layout of L3 caches 1 Introduction 3 The design space of the macroarchitecture of MCPs 8 Implementation of the on-chip interconnections 4 Layout of the cores 10 Multicore repository

1. Introduction

Figure 1.1: Single-stream performance vs. cost Source: Marr T.T. et al. „Hyper-Threading Technology Architecture and Microarchitecture Intel Technology Journal, Vol. 06, Issue 01, Febr 14, 2002, pp Introduction (1)

1. Introduction (2) Figure 1.2: Historical evolution of transistor counts, clock speed, power and ILP

For single core processors: Increasing number of transistors diminishing return in performance Moore’s law: Transistor count: doubles approximately every 24 month Since ~ 2003/2004: Use available surplus transistors for multiple cores 1. Introduction (3)

The era of single core processors the era of multi core processors 1. Introduction (4)

The speed of change: illustrated by Intel’s roadmap revealed in 2005: 1. Introduction (5)

1. Introduction (6) Figure 1.3: Intel’s roadmap revealed in 2005

1. Introduction (7) Expected increase of the core count: As far as Moore’s law remains valid, transistor count is doubling approximately every 24 month, accordingly core count is doubling approximately every 24 month as well.

2. Overview of MCPs

Figure 2.1: Intel’s dual-core mobile processor lines 8CST QCMT QCST DCMT DCST Q2Q Q2Q 3Q4Q 3Q4Q Core Duo 1/06 (Yonah Duo) 151 mtrs./31 W 65 nm/90 mm 2 Core 2 Duo 7/ mtrs./34 W (Merom) T5000/T nm/143 mm 2 T Overview of MCPs (1)

Figure 2.2: Intel’s multi-core processor lines 8CST QCMT QCST DCMT DCST Q2Q3Q4Q1Q2Q3Q4Q 230 mtrs./95 W Pentium D 800 (Smithfield) 5/05 90 nm/2*103 mm 2 Core 2 Duo E6000/X6800 (Conroe) 291 mtrs./65/75 W 7/06 65 nm/143 mm 2 Pentium D 900 (Presler) 2*188 mtrs./130 W 65 nm/2*81 mm 2 1/06 Core 2 Extrem QX6700 (Kentsfield) 2*291 mtrs./130 W 11/06 65 nm/2*143 mm 2 Pentium EE 955/965 (Presler) 2*188 mtrs./130 W 1/06 65 nm/2*81 mm 2 2-way MT/core Pentium EE 840 (Smithfield) 230 mtrs./130 W 5/05 90 nm/2*103 mm 2 2-way MT/core 2. Overview of MCPs (2)

Figure 2.3: Intel’s dual-core Xeon UP-line 8CST QCMT QCST DCMT DCST Q2Q Q2Q 3Q4Q 3Q4Q Xeon / mtrs./65 W 65 nm/143 mm 2 (Conroe) 2. Overview of MCPs (3)

Figure 2.4.: Intel’s dual-core Xeon DP-lines 8CST QCMT QCST DCMT DCST Q2Q Q2Q 3Q4Q 3Q4Q Xeon DP /05 2*169 mtrs./135 W 90 nm/2*135 mm 2 (Paxville DP) Xeon /06 2*291 mtrs./80/120 W 65 nm/2*143 mm 2 (Clovertown) Xeon / mtrs./65/80 W 65 nm/143 mm 2 (Woodcrest) 2-way MT/core 2*188 mtrs./95/130 W Xeon /06 65 nm/2*81 mm 2 (Dempsey) 2-way MT/core 2. Overview of MCPs (4)

Figure 2.5.: Intel’s dual-core Xeon MP-lines 8CST QCMT QCST DCMT DCST Q2Q Q2Q 3Q4Q 3Q4Q Xeon /05 2*169 mtrs./95/150 W 90 nm/2*135 mm 2 (Paxville MP) 2-way MT/core 1328 mtrs./95/150 W Xeon /06 65 nm/435 mm 2 (Tulsa) 2-way MT/core 2. Overview of MCPs (5)

Figure 2.6.: Intel’s dual-core EPIC-based server line 8CST QCMT QCST DCMT DCST Q2Q Q2Q 3Q4Q 3Q4Q 2-way MT/core 9x00 (Montecito) 1720 mtrs./104 W 7/06 90 nm/596 mm 2 2. Overview of MCPs (6)

Figure 2.7.: AMD’s multi-core desktop processor lines 8CST Athlon64 QC (Barcelona) 9/ 2007 QCMT QCST DCMT DCST Q2Q3Q4Q1Q2Q3Q4Q 3800/4200/4600 (Manchester S. E2) 154 mtrs./89/110 W 5/05 90 nm/2*147 mm (Toledo S. E6) 233 mtrs./89/110 W 5/05 90 nm/199 mm (Brisbane S. G1) 154 mtrs./65 W 12/06 65 nm/126 mm (Windsor S. F2/F3) 154/227 mtrs./65/89 W 5/06 90 nm/183/230 mm Q2Q3Q4Q 65 nm/291 mm 2 2. Overview of MCPs (7)

Figure 2.8.: AMD’s dual-core high-end desktop/entry level server Athlon 64 FX line 8CST QCMT QCST DCMT DCST Q2Q3Q4Q1Q2Q3Q4Q Athlon 64 FX (Toledo/Windsor) 233/227 mtrs./110/125 W 1/06 90 nm/199/230 mm 2 2. Overview of MCPs (8)

Figure 2.9.: AMD’s dual-core Opteron UP-lines 8CST QCMT QCST DCMT DCST Q2Q3Q4Q1Q2Q3Q4Q 233 mtrs./110 W Opteron 100 8/05 90 nm/199 mm 2 (Denmark) Opteron / mtrs./103/125 W 90 nm/230 mm 2 (Santa Ana) 2. Overview of MCPs (9)

Figure 2.10.: AMD’s dual-core Opteron DP-lines 8CST QCMT QCST DCMT DCST Q2Q3Q4Q1Q2Q3Q4Q 227 mtrs./95/110 W Opteron /06 90 nm/230 mm 2 (Santa Rosa) Opteron 200 3/ mtrs./95 W 90 nm/199 mm 2 (Italy) 2. Overview of MCPs (10)

Figure 2.11.: AMD’s dual-core Opteron MP-lines 8CST QCMT QCST DCMT DCST Q2Q3Q4Q1Q2Q3Q4Q 243 mtrs./95/119 W Opteron /06 90 nm/220 mm 2 (Santa Rosa) Opteron 800 4/ mtrs./95 W 90 nm/199 mm 2 (Egypt) 2. Overview of MCPs (11)

Figure 2.12.: IBM’s multi-core server lines POWER4 POWER4+ POWER5 POWER5+ Cell BE POWER6 8CST 174 mtrs./115/125 W 10/01 11/02 QCMT QCST DCMT DCST Q4Q 3Q4Q ~ ~ ~ ~ 5/ Q2Q ~ ~ 10/ mtrs./70 W 3Q4Q mtrs./95 W Q2Q 3Q4Q 5/ mtrs./~100W Q2Q 180 nm/412 mm 2 90 nm/230 mm 2 90 nm/221 mm 2 65 nm/341 mm mtrs./80W (est.) 130 nm/389 mm mtrs./70 W 130 nm/380 mm 2 2-way MT/core (PPE:2-way MT) 2-way MT/core (SSEs: no MT) 2. Overview of MCPs (12)

Figure 2.13.: Sun’s and Fujitsu’s multi-core server lines 2. Overview of MCPs (13)

Figure 2.14.: HP’s PA-8x00 server line 8CST QCMT QCST DCMT DCST Q2Q Q2Q 3Q4Q PA / mtrs./55 W (Mako) PA / mtrs. (Shortfin) 3Q4Q 130 nm/366 mm 2 2. Overview of MCPs (14)

Figure 2.15.: RMI’s multi-core XLR-line (scalar RISC) 8CST QCMT QCST DCMT DCST Q2Q 3Q4Q Q2Q 3Q4Q 8CMT XLR 5xx 5/ mtrs./10-50 W 90 nm/~220 mm 2 4-way MT/core 2. Overview of MCPs (15)

3. Design space of the macro architecture of MCPs

Building blocks of multicore processors (MC) Cores L2 cache(s) (L2) FSB controller (FSB c.) L3 cache(s), if available (L3) Memory controller (M. c.) Interconnection network (IN) L2 IL2 D L3 L2 IL2 D L3 IN FSB FSB c. C C or 3. Design space of the macro architecture of MCPs (1) Bus controller (B. c.)

Layout of L2 caches Layout of the cores Layout of the I/O and memory architecture Macro architecture of multi-core (MC) processors Layout of L3 caches (if available) 3. Design space of the macro architecture of MCPs (2) Layout of the interconnections

4. Layout of the cores

Layout of L2 caches Layout of the cores Layout of the I/O and memory architecture Layout of L3 caches (if available) 5. Layout of the cores (1) Layout of the on-chip interconnections Macro architecture of multi-core (MC) processors

Physical implementation Basic layout Advanced features Layout of the cores 4. Layout of the cores (2)

4. Layout of the cores (3) HMC (Heterogeneous MC) Basic layout SMC (Symmetrical MC) Cell BE (1 PPE+8 SPE) All other MCs

4. Layout of the cores (4) Virus protection Advanced features of the cores Remote management Power/clock speed management Data protection Support of virtualisation Details on the next slide!

4. Layout of the cores (5) 1 2 Pentium M-based line: Core Duo T2000 Prescott based desktop lines:Pentium D 8xx, EE Xeon lines: Paxwille DP, MP Cedar Mill based desktop lines: Pentium D 9xx, EE 955,965 4 Core basd mobile line: T5xxx/T7xxx (Merom) desktop line: Core 2 Duo E6xx, X6xxx, QX67xx Xeon lines: 5000 (Dempsey), 7100 (Tulsa) Xeon lines: 3000, 5100 (Woodcrest), 5300 (Clovertown) Intel Netburst Prescott-based 2 Cedar Mill-based 3 Core-based 4 Itanium 2 AMD Athlon 64 XR Athlon 64 FX Opteron x00 lines x000 lines Pentium M-based 1 ED-bit P Vanderpool (partly) La GrandeAMT2 Nx-bitCool 'n' QuietPacifica (partly) Presidio ED-bitEIST5/02 ED-bit EIST (partly) Vanderpool (partly) DBS Silvervale Nx-bit Cool 'n' Quiet PowerNow! Pacifica AMD - V Power/clock speed management Support of virtualisation Data protection Remote management Virus protection Table 4.1 : Comparison of advanced features provided by Intel’s and AMD’s MC lines. For a brief introduction of the related Intel techniques see the appropriate Intel processor descriptions

EIST: Enhanced Intel SpeedStep Technology First implemented in Intel’s mobile and server platforms, It allows the system to dynamically adjust processor voltage and core frequency, to decrease average power consumption and thus average heat production. This technology is similar to AMD’s Cool’n’Quiet and PowerNow! techniques. For more information see The ED bit with OS support is now a widely used technique to prevent certain classes of malicious buffer overflow attacks. It is used e.g. in Intel’s Itanium, Pentium M, Pentium 4 Prescott and subsequent processors as well as in AMD’s x86-64 line (designated as the NX bit (No Execute bit). In buffer overflow attacks a malicious worm inserts its code into another program’s data space and creates a flood of code that overwhelms the processor, allowing the worm to propagate itself to the network, and to other computers. The Execute Disable bit allows the processor to classify areas in memory by where application code can be executed and where it cannot. Actually, the ED bit is implemented as the 63. bit of the Page Table Entry. If its value is 1, code cannot be executed from that page. When a malicious worm attempts to insert code in the buffer, the processor disables code execution, preventing damage and worm propagation. To became active the ED feature need to be supported by the OS. For more information see: ED: Execute Disable bit (often designated as the NX bit (No Execute bit)) Brief description of the advanced features pointed out while using Intel’s notations (1) 4. Layout of the cores (6)

Virtualisation techniques allow a platform to run multiple operating systems and applications in independent partitions. The hardver support improves the performance and robustness of traditional software-based virtualization solutions. Intel introduced hardware virtualisation support first in its Itanium line (called the Silvervale tecnique) followed by its Presler based and subsequent processors (designated as the Vanderpool technique). AMD uses this technique in its recent lines - dubbed as Pacifica - since Mai For more information see VT: Virtualization Technology It is a Trusted Execution Technology to protect users from software-based attacks aimed at stealing vital data, initiated by Intel. It includes a set of hardware enhancements intended to provide multiple separated execution environments. For more information see Intel® Trusted Execution TechnologyIntel® Trusted Execution Technology Preliminary Architecture SpecificationPreliminary Architecture Specification at orhttp:// or Intel’s Active Management Technology is a feature of its vPro technology. It allows to manage the computer system remotely, even if the computer is switched off or the hard drive has failed. By means of this technology system administrators can e.g. remotely download software updates or fix and hail system problems. It can also be utilized to protect the network by proactively blocking incoming threats. For more information see Intel Active Management Technology at La Grande Technology AMT2 4. Layout of the cores (7) Brief description of the advanced features pointed out while using Intel’s notations (2)

Multi-chip implementation Physical implementation Monolithic implementation Paxville MP Pentium D 9xx Paxville DP Kentsfield Yonah Duo Pentium D 8xx Tulsa Montecito AMD, IBM, Sun Dempsey Clovertown Fujitsu, HP and RMI Woodcrest Merom Conroe Pentium EE 840 All DCs of Pentium EE 955/ Layout of the cores (8)

5. Layout of L2 caches

Layout of L2 caches Layout of the cores Layout of the I/O and memory architecture Layout of L3 caches (if available) 5. Layout of L2 caches (1) Layout of the on-chip interconnections Macro architecture of multi-core (MC) processors

Inclusion policy Allocation to the cores Banking policy Layout of L2 caches Use by instructions/data Integration of L2 caches to the proc. chip 5. Layout of L2 caches (2)

Shared L2 cache for all cores Allocation of L2 caches to the cores Private L2 cache for each core Athlon 64 X2 (2005) IN (Xbar) Memory System Request Queue B. c. M. c. HT-bus L2 Core Core Duo (2006) Core 2 Duo (2006) L2 contr. Core L2 Core FSB c. FSB Examples: 5. Layout of L2 caches (3)

Shared L2 cache for all cores Allocation of L2 caches to the cores Private L2 cache for each core POWER4 (2001) Montecito (2006) UltraSPARC IV (2004) Athlon 64 X2 and AMD's Opteron lines (2005/2006) POWER5 (2005) Core Duo (Yonah), Core 2 Duo (Core) based lines (2006) UltraSPARC T1 (2005) UltraSPARC T2 (2005) POWER6 (2007) PA 8800 (2004) PA 8900 (2005) Smithfield, Presler, Irwindale and Cedar Mill based lines (2005/2006) SPARC64 VI (2007) SPARC64 VII (2008) Cell BE (2006) XLR (2005) UltraSPARC IV+ (2005) 5. Layout of L2 caches (4)

IBM: The effect of changing from shared L2 to private L2 inprove latency eliminating any impact for multicore competition for capacity. Source: IBM JRD Nov 2007

Inclusion policy Allocation to the cores Banking policy Attaching L2 caches to MCPs Use by instructions/data Integration of L2 caches to the proc. chip 5. Layout of L2 caches (5)

Lines replaced (victimized) in the L1 are written to L2. References to data missing in L1 but available in L2 initiate reloading the pertaining cache line to L1. Reloaded data is deleted from L2 L2 operates usually as a write back cache (only modified data that is replaced in the L2 is written back to the memory), Unmodified data that is replaced in L2 is deleted. L1 M.c. Replaced (victimised) lines Modified data (low traffic) Lines missing in L1 are reloaded and deleted from L2 L2 Memory Data missing in L1/L2 (high traffic) L1 L2 Memory Exclusive L2 Inclusion policy of L2 caches Inclusive L2 (Victim cache) 5. Layout of L2 caches (6)

Figure 5.1: Implementation of exclusive L2 caches Source: Zheng, Y., Davis, B.T., Jordan, M.: “ Performance evaluation of exclusive cache hierarchies”, 2004 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2004, pp Layout of L2 caches (7)

Exclusive L2 Inclusion policy of L2 caches Inclusive L2 Most implementationsAthlon 64X2 (2005) Dual-core Opteron lines (2005) Expected trend 5. Layout of L2 caches (8)

Inclusion policy Allocation to the cores Banking policy Attaching L2 caches to MCPs Use by instructions/data Integration of L2 caches to the proc. chip 5. Layout of L2 caches (9)

Unified instr./data cache(s) Use by instructions/data Split instr./data cache(s) Expected trend Typical implementationMontecito (2006) 5. Layout of L2 caches (10)

Inclusion policy Allocation to the cores Banking policy Attaching L2 caches to MCPs Use by instructions/data Integration of L2 caches to the proc. chip 5. Layout of L2 caches (11)

Single-banked implementation Banking policy Multi-banked implementation Typical implementation 5. Layout of L2 caches (12)

UltraSPARC T1 (2005) (Niagara) (8 cores/4xL2 banks) Core X-bar Core L2 Memory M. c. Memory M. c. The 128-byte long L2 cache lines are hashed across the 3 modules. Hashing is performed by modulo 3 arithmetric applied on a large number of real address bits. The four L2 modules are interleaved at 64-byte blocks. Mapping of addresses to the banks: Addr Modulo 3 ∑ Mapping of addresses to the banks: POWER4 (2001) POWER5 (2005) Core X-bar Core L2 Fabric Bu SContr. IN (Fabric Bus Contr.) L3 tags/ contr. B. c. GX bus Figure 5.2: Alternatives of mapping addresses to memory modules Mapping of addresses to memory modules 5. Layout of L2 caches (13)

Inclusion policy Allocation to the cores Banking policy Attaching L2 caches to MCPs Use by instructions/data Integration of L2 caches to the proc. chip 5. Layout of L2 caches (14)

Partial integration of L2 Integration to the processor chip Full integration of L2 All other lines considered UltraSPARC IV (2004) UltraSPARC V (2005) Expected trend PA 8800 (2004) PA 8900 (2005) On-chip L2 tags/control, off-chip data Entire L2 on-chip 5. Layout of L2 caches (15)

6. Layout of L3 caches

Layout of L2 caches Layout of the cores Layout of the I/O and memory architecture Layout of L3 caches (if available) 6. Layout of L3 caches (1) Layout of the interconnections Macro architecture of multi-core (MC) processors

Inclusion policy Allocation to the L2 cache(s) Banking policy Layout of L3 caches Use by instructions/data Integration of L3 caches to the proc. chip 6. Layout of L3 caches (2)

Shared L3 cache for all L2s Allocation of L3 caches to the L2 caches Private L3 cache for each L2 Montecito (2006) UltraSPARC IV+ (2004) POWER5 (2005) POWER4 (2001) Athlon 64 X2–Barcelona (2007) 6. Layout of L3 caches (3)

Inclusion policy Allocation to the L2 cache(s) Banking policy Layout of L3 caches Use by instructions/data Integration of L3 caches to the proc. chip 6. Layout of L3 caches (4)

Lines replaced (victimized) in the L2 are written to L3. References to data missing in L2 but available in L3 initiate reloading the pertaining cache line to L2. Reloaded data is deleted from L3 L3 operates usually as a write back cache (only modified data that is replaced in the L3 is written back to the memory), Unmodified data that is replaced in L3 is deleted. L2 M.c. Replaced (victimised) lines Modified data (low traffic) Lines missing in L2 are reloaded and deleted from L3 L3 Memory Data missing in L2/L3 (high traffic) L2 L3 Memory Exclusive L3 Inclusion policy of L3 caches Inclusive L3 (Victim cache) 6. Layout of L3 caches (5)

Exclusive L3 Inclusion policy of L3 caches Inclusive L3 Montecito (2006) UltraSPARC IV+ (2005) POWER4 (2001)POWER5 (2005) Athlon 64 X2–Barcelona (2007) Expected trend 6. Layout of L3 caches (6) POWER 6 (200 7 )

Inclusion policy Allocation to the L2 cache(s) Banking policy Layout of L3 caches Use by instructions/data Integration of L3 caches to the proc. chip 6. Layout of L3 caches (7)

Unified instr./data cache(s) Use by instructions/data Split instr./data caches All multicore processors unveiled until now hold both instructions and data 6. Layout of L3 caches (8)

Inclusion policy Allocation to the L2 cache(s) Banking policy Layout of L3 caches Use by instructions/data Integration of L3 caches to the proc. chip 6. Layout of L3 caches (9)

Single-banked implementation Banking policy Multi-banked implementation All multicore processors unveiled until now are multi-banked 6. Layout of L3 caches (10)

Inclusion policy Allocation to the L2 cache(s) Banking policy Layout of L3 caches Use by instructions/data Integration of L3 caches to the proc. chip 6. Layout of L3 caches (11)

L3 partially integrated Integration to the processor chip L3 fully integrated POWER4 (2001) UltraSPARC IV+ (2005) POWER5 (2005) Montecito (2006) Expected trend POWER6 (2007) On-chip L3 tags/control, off-chip data Entire L3 on-chip 6. Layout of L3 caches (12)

Examples of partially integrated L3 caches: Core L3 tags/contr. L3 data Interconn. network M. c.. Memory B. c. Fire Plane bus Core L2 Fabric Bus Contr. L2 Memory M. c. L3 tags/contr. L3 data POWER5 (2005): UltraSPARC IV+ (2005): 6. Layout of L3 caches (13)

6. Layout of L3 caches (14) Figure 6.1: Cache architecture of MC processors (Not differentiating between inclusive/exclusive alternatives in the figure, but denotingit in the examples) L2 private L2 partially integrated, no L3 Cache architecture L2 shared L2 private L2 shared L2 fully integrated, no L3 L2 private L2 shared L2 fully integrated, L3 partially integrated L2 private L2 shared L2 fully integrated, L3 fully integrated L3 private L3 shared L3 private L3 shared L3 private L3 shared L3 private L3 shared PA-8800 UltraSPARC IV Montecito POWER6 Core Duo (Yonah) Core 2 Duo based proc-s Cell BE UltraSPARC T1/T2 SPARC 64 VI/VII XLR POWER4 POWER5(excl.) UltraSPARC IV+ (excl.) PA-8900 Smithfield/Presler based proc.s Athlon 64X2 (excl.) Opteron dual core (excl.) Gemini

6. Layout of L3 caches (15) Examples for cache architectures (1) L2 partially integrated, private no L3 UltraSPARC IV (2004) Core Interconn. network M. c. Memory B. c. Fire Plane bus Core L2 data L2 tags/contr. L2 partially integrated, shared no L3 PA-8800 (2004) PA-8900 (2005) L2 data Core L2 tags/contr. Core FSB c. FSB

UltraSPARC T1 (2005) L2 M. c. B. c. L2 Core 7 M. c. Core 0 X b a r Memory JBus L2 fully integrated, private no L3 L2 fully integrated, shared no L3 Athlon 64 X2 (2005) IN (Xbar) Memory System Request Queue B. c. M. c. HT-bus L2 Core Core Duo (2006) Core 2 Duo (2006) L2 contr. Core L2 Core FSB c. FSB Examples for cache architectures (2) 6. Layout of L3 caches (16)

6. Layout of L3 caches (17) UltraSPARC IV+ (2005) Core L3 tags/contr.L3 data Interconn. network M. c. Memory B. c. Fire Plane bus Core L2 POWER4 (2001) IN (Fabric Bus Contr.) M. c. Memory B. c. L3 tags contr. L3 data L2 GX-bus Chip-to-chip/ Mem.-to-Mem. interconn. L2 fully integrated, shared L3 partially integrated Examples for cache architectures (3)

Core L2 IL2 D L3 Core L2 IL2 D L3 FSB c. FSB Montecito (2006) L2 fully integrated, private, L3 fully integrated Examples for cache architectures (4) 6. Layout of L3 caches (18)

7. Layout of memory and I/O architecture

Layout of L2 caches Layout of the cores Layout of the I/O and memory architecture Layout of L3 caches (if available) 7. Layout of memory and I/O architecture (1) Layout of the interconnections Macro architecture of multi-core (MC) processors

7. Layout of memory and I/O architecture (2) The concept of connection Layout of the I/O and memory connection Point of attachment Physical implementations of the M. c.

7. Layout of memory and I/O architecture (3) Connection via the FSB The concept of connection Connection via the M. c and B. c Used typically in connection with off-chip M. c.-s Used in connection with on-chip M. c.-s Examples: Core L2 IL2 D L3 Core L2 IL2 D L3 FSB c. FSB Core 2 Duo Montecito IN Core L2 Core FSB c. FSB L2 M. c.. B. c.. L2 Core 7 M. c.. Core 0 X b a r Memory JBus UltraSPARC T1

7. Layout of memory and I/O architecture (4) The highest cache level (via an IN) The point of attachment The IN connecting the two highest cache levels L2 (private/shared) L3 (private/shared) The IN connecting the cores and a shared L2 The IN connecting a shared L2 and a shared L3 L2 IN L2 L3 IN L3 IN1 L2 C C IN L3 L2 (2-level caches)(3-level caches)(2-level caches)(3-level caches) Examples:

7. Layout of memory and I/O architecture (5) Data missing in L2/L3 (high traffic) Memory L2 M.c. Replaced (victimised) lines Modified data (low traffic) Lines missing in L2 are reloaded and deleted from L3 L3 Memory L2 IN L2 L3 M.c. L3 M.c. Memory L3 L2 Montecito (2006) POWER4 (2001) UltraSPARC IV+ (2004) POWER5 (2004) Exclusive L3 Inclusion policy of L3 caches Inclusive L3 Cache inclusivity vs. memory attachment

7. Layout of memory and I/O architecture (6) The highest cache level (via an IN) The point of attachment The IN connecting the two highest cache levels L2 (private/shared) L3 (private/shared) The IN connecting the cores and a shared L2 The IN connecting a shared L2 and a shared L3 L2 IN L2 L3 IN L3 IN1 L2 C C IN L3 L2 (2-level caches)(3-level caches)(2-level caches)(3-level caches) The M. c is connected usually in this way if the highest level cache is exclusive. The M. c is connected usually in this way if the highest level cache is inclusive. Examples:

7. Layout of memory and I/O architecture (7) Core L2 IL2 D L3 Core L2 IL2 D L3 FSB c. FSB Montecito (2006) IN (Xbar) Memory System Request Queue B. c. M. c. HT-bus L2 Core L2 contr. Core L2 Core FSB c. FSB Athlon 64 X2 (2005)Core 2 Duo (2006) Examples for attaching memory and I/O via the highest cache level In case of a two-level cache hierarchyIn case of a three-level cache hierarchy

7. Layout of memory and I/O architecture (8) UltraSPARC T1 (2005)UltraSPARC IV+ (2005) Examples for attaching memory and I/O via the interconnection network connecting the two highest levels of the cache hierarcy In case of a two-level cache hierarchyIn case of a three-level cache hierarchy Core L3 tags/contr. L3 data Interconn. network M. c. Memory B. c. Fire Plane bus Core L2 M. c. B. c. L2 Core 7 M. c. Core 0 X b a r Memory JBus (exclusive L3)

Off-chip memory controller On-chip memory controller Integration of the memory controller to the processor chip POWER4 (2001) UltraSPARC IV+ (2005) POWER5 (2005) Montecito (2006?) UltraSPARC T1 (2005) UltraSPARC IV (2004) Athlon 64 X2 (2005) Presler (2005) Smithfield (2005) PA-8800 (2004) PA-8900 (2005) Core (2006) Yonah Duo (2006) Expected trend 7. Layout of memory and I/O architecture (9) Longer access times, but provides independency of memory technology Shortens access times, but causes dependency of memory technology and speed

8. Layout of the on-chip interconnections

Layout of L2 caches Layout of the cores Layout of the I/O and memory architecture Layout of L3 caches (if available) 1. Layout of interconnections (1) Layout of the interconnections Macro architecture of multi-core (MC) processors

8. Layout of interconnections (2) IN1 L2 C C IN L2 L3 L2/L3 IN L2/L3 B. c. M. c. FSB c. I/O bus FSB Memory Chip Module Chip or Between private or shared L2 modules and shared L3 modules Between cores and shared L2 modules Between private or shared L2/L3 modules and the B. c./M. c. or alternatively the FSB c. On-chip interconnections Chip-to-chip and module-to-module interconnects

8. Layout of interconnections (3) Arbiter/Multi-port cache implementations Crossbar Implementation of interconnections Ring For a small number of sources/destinations For a larger number of sources/destinations Cell BE (2006) XRI (2005) eg. to connect dual-cores to shared L2 caches UltraSPARC T1 (2005) UltraSPARC T2 (2007) Quantitative aspects, such as the number of sources/destinations or bandwidth requirement affect which implementation alternative is the most beneficial.

9. Basic alternatives of macro-architectures of MC processors

9. Basic alternatives of macro-architectures of MC processors (1) Basic dimensions spanning the design space Kind of related cache architecture Layout of the memory and I/O connection Layout of the IN(s) involved Cache inclusivity affects the layout of the memory connection Interrelationships between particular dimensions

9. Basic alternatives of macro-architectures of MC processors (2) Considering a selected part of the design space of the macroarchitecture of MC processors Assumptions: L2 is supposed to be partially integrated on the chip (i.e. only the tags are on the chip). This is not indicated in the design space tree but denoted in the examples. We do not consider how the involved INs is implemented. In case of an off-chip M. controller the use of an FSB is assumed. A two-level cache hierarchy is considered (no L3).

9. Basic alternatives of macro-architectures of MC processors (3) Off-chip M. c.On-chip M. c. Design space of the macroarchitecture of MC processors (in case of a two-level cache hierarchy) L2 privateL2 sharedL2 privateL2 shared Part of the design-space specifying the point of attachment for the memory and I/O

9. Basic alternatives of macro-architectures of MC processors (4) Core 2 Duo based processors (2006) (2 cores) SPARC64 VI (2007) (2 cores) SPARC64 VII (2008) (4 cores) Smithfield/Presler based processors (2005/2006) (2 cores) PA-8800 (2004) (2 cores, L2 data off-chip) FSB c. FSB IN1 L2 IN2 L2 C C C FSB c. IN L2 FSB C C C L2 privateL2 shared M. c. off-chip Design space of the macroarchitecture of MC processors (2-level hierarchy) (1)

9. Basic alternatives of macro-architectures of MC processors (5) Design space of the macroarchitecture of MC processors (2-level hierarchy) (2) M. c. on-chip L2 private L2 shared Attaching both I/O and M. c. to L2 via IN2 Attaching M. c. to L2 via IN2 and I/O to IN1 Attaching both I/O and M.c. via IN1 Attaching I/O. to L2 via IN2 and M.c. to IN1 UltraSPARC T1 (2005) (8 cores) Cell BE (2006) (8 cores) Opteron dual core (2 cores),(2005) UltraSPARC IV (2 cores, L2 data off-chip), (2004) Athlon 64 X2 (2005) (2 cores) IN L2 B. c. I/O-bus Memory M. c. C C IN1 L2 B. c. I/O-bus Memory M. c. IN2 L2 C C I/O-bus IN1 L2 B. c. Memory M. c. IN2 L2 C C IN1 L2 B. c. I/O-bus Memory M. c. L2 C C IN1 L2 M. c. Memory I/O-bus B. c. IN2 L2 C C