Download presentation
Presentation is loading. Please wait.
Published byTrever Meggett Modified over 9 years ago
1
Multicore Architectures Michael Gerndt
2
Development of Microprocessors Transistor capacity doubles every 18 months © Intel
3
Development of Microprocessors Moore’s Law Estimated to stay at least in next 10 years But: Transistor count ≠ Power How to use transistor resources? Better execution core –Enhance pipelining, superscalarity, … –Better vector processing (SIMD, like MMX/SSE) –Problem: Gap to memory speed Larger Caches –Improves memory access speed More execution cores –Problem: Gap to memory speed …
4
Development of Microprocessors Objective for manufactures As much profit as possible: Sell processors … Customers only buy when applications run faster Increase CPU power How to increase CPU power Higher clock rate More parallelism –Instruction Level Parallelism (ILP) –Thread Level Parallelism (TLP)
5
Development of Microprocessors Higher clock rates increase power consumption –proportional to f and U² –higher frequency needs higher voltage –Small structures: Energy loss by leakage increase heat output and cooling requirements limit chip size (speed of light) at fixed technology (e.g. 60 nm) –Smaller number of transistor levels per pipeline stage possible –More, simplified pipeline stages (P4: >30 stages) –Higher penalty of pipeline stalls (on conflicts, e.g. branch misprediction)
6
Development of Microprocessors More parallelism Increased bit width (now: 64 bit architectures) –SIMD Instruction Level Parallelism (ILP) –exploits parallelism found in a instruction stream –limited by data/control dependencies –can be increased by speculation –average of ILP in typical programs: 6-7 –modern superscalar processors can not get better…
7
Development of Microprocessors More parallelism Thread Level Parallelism (TLP) –Hardware multithreaded (e.g. SMT: Hyperthreading) –better exploitation of superscalar execution units –Multiple cores –Legacy software must be parallelized –Challenge for whole software industry –Intel moved into the tools business
8
Multicore Architectures SMPs on a single chip Chip Multi-Processors (CMP) Advantage Efficient exploitation of available transistor budget Improves throughput and speed of parallelized applications Allows tight coupling of cores –better communication between cores than in SMP –shared caches Low power consumption –low clock rates –idle cores can be suspended Disadvantage Only improves speed of parallelized applications Increased gap to memory speed
9
Multicore Architectures Design decisions homogeneous vs. heterogeneous –specialized accelerator cores –SIMD –GPU operations –cryptography –DSP functions (e.g. FFT) –FPGA (programmable circuits) –access to memory –own memory area (distributed memory) –via cache hierarchy (shared memory) Connection of cores –internal bus / cross bar connection –Cache architecture
10
Multicore Architectures: Examples Core L1 L2 L3 Memory Module 1 Memory Module 2 I/O Homogeneous with shared caches and cross bar Core (2x SMT) Core L1 L2 Core Local Store Local Store Core Local Store Local Store I/O Memory Module Heterogeneous with caches, local store and ring bus
11
Shared Cache Design Memory Core L1 L2 Switch Memory Core L1 L2 Switch Traditional design Multiple single-cores with shared cache off-chip Core L1 L2 Switch Multicore Architecture Shared Caches on-chip
12
Shared Cache Design Memory Core L1 L2 Switch Core L1 Multicore Architecture Shared Caches on-chip
13
Shared Caches: Advantages No coherence protocol at shared cache level Less latency of communication Processors with overlapping working set One processor may prefetch data for the other Smaller cache size needed Better usage of loaded cache lines before eviction (spatial locality) Less congestion on limited memory connection Dynamic sharing if one processor needs less space, the other can use more Avoidance of false sharing
14
Shared Caches: Disadvantages Multiple CPUs higher requirements higher bandwidth Cache should be larger (larger higher latency) Hit latency higher due to switch logic above cache Design more complex One CPU can evict data of other CPU
15
Multicore Processors SUN UltraSparc IV / IV+ –dual core –2x multithreaded per core UltraSparc T1 (Niagara): –8 cores –4x multithreaded per core –one FPU for all cores –low power UltraSparc T2 (Niagara 2)
16
Intel Itanium 2 Dual Core - Montecito Two Itanium 2 cores Multi-threading (2 Threads) –Simultaneous multi-threading for memory hierarchy resources –Temporal multi-threading for core resources –Besides end of time slice, an event, typically an L3 cache miss, might lead to a thread switch. Caches –L1D 16 KB, L1I 16 KB –L2D 256 KB, L2I 1 MB –L3 9 MB Caches private to cores 1,7 Billion transistors
17
Itanium 2 Dual Core
18
Intel Core Duo 2 mobile-optimized execution cores No multi-threading Cache hierarchy Private 32-KB L1I and L1D Shared 2 MB L2 cache Provides efficient data sharing between both cores Power reduction Some states individually by each processor Deeper and enhanced deeper sleep states only for die Dynamic Cache Sizing feature –Flushes entire cache –This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity 151 Million transistors
19
IBM Cell IBM, Sony, Toshiba Playstation 3 (Q1 2006) 256 GFlops Bei 3 GHz nur ~30W ganze PS3 nur 300-400$ http://www-128.ibm.com/developerworks/power/library/pa-cellperf
20
Cell: Architecture 9 parallele processors Specialized for different tasks 1 large PPE - 8 SPEs Synergistic Processing Element
21
Cell: SPE Synergistic Processing Element 128 registers 128-Bit SIMD Single Thread 256KByte local memory not cache DMA execute memory transfers Simple ISA Less functionality to save space Limitations can become a problem if memory access is too slow. 25,6 GFlops single precision für multiply-add operations
22
Intel Westmere EX Processor of the fat node of SuperMUC @ LRZ 2,4 GHz 9.6 Gflop/s per core 96 Gflop/s per socket 10 hyperthreaded cores, i.e. two logical cores each Caches 32 KB L1 private 256 KB L2 private 30 MB L3 shared 2,9 billion transistors Xeon E7-4870 (2,4 GHz, 10 Kerne, 30 MByte L3)
24
NUMA On-chip NUMA L3 Cache organized in 10 slices Interconnection via a bidirectional ring bus 10-way physical address hashing to avoid hot spots, and can handle five parallel cache requests per clock cycle Mapping algorithm is not known, no migration support Off-chip NUMA Glueless combination of up to 8 sockets into SMP 4 Quick Path Interconnect (QPI) interfaces 2 on-chip memory controllers
27
Cache Coherency Cbox Connects core to ringbus and one memory bank Responsible for processor read/write/writeback and external snoops, and returning cached data to core and QuickPath agents. Distribution of physical addresses is determined by hash function Sbox Caching Agent Each associated with 5 Cboxes
28
Cache Coherency Bbox Home agent Responsible for cache coherency of the cache line in this memory. Keeps track of the Cbox replies due to coherence messages. Directory Assited Snoopy (DAS) Keeps states per cache line (I – Idle or no remote sharers, R – may be present on remote socket, E/D owned by IO Hub) If line is in I state it can be forwarded without waiting for snoop replies.
32
Summary High frequency -> high power consumption Trend towards multiple cores on chip Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, … Problem: memory latency and bandwidth
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.