Download presentation
Presentation is loading. Please wait.
Published byLauren Ward Modified over 9 years ago
1
Dezső Sima Multicore and Manycore Processors December 2008 Overview and Trends
2
Overview 1. Overview 2. Homogeneous multicore processors 3. Heterogeneous multicore processors 2.1 Conventional multicores 3.1 Master/slave architectures 3.2 Attached processor architectures 4. Outlook 2.2 Manycore processors
3
1. Overview – inevitability of multicores
4
Figure: Evolution of Intel’s IC fab technology [1] 1. Overview – inevitability of multicores (1) Shrinking: ~ 0.7/2 Years
5
1. Overview – inevitability of multicores (2) IC fab technology Moore’s rule same number of transistors: on ½ Si die area Shrinking ~ 0.7x/2 years) on the same die area: 2x as many transistors every two years Doubling transistor counts ~ every two years (on the chips) (2. formulation: from 1975)
6
Utilization of the surplus transistors? Wider processor width superscalar 1. Gen.2. Gen. 1 2 4 pipeline Doubling transistor counts ~ every two years 1. Overview – inevitability of multicores (3)
7
1. Overview – inevitability of multicores (4) Figure: Parallelism available in applications [2] Available parallelism in general purpose apps: ~ 4-5
8
Utilization of the surplus transistors? Wider processor widthCore enhancementsCache enhancements superscalar branch prediction speculative loads... L2/L3 enhancements (size, associativity...) 1. Gen.2. Gen. 1 2 4 pipeline Doubling transistor counts ~ every two years 1. Overview – inevitability of multicores (5)
9
The best use of surplus transistors is: multiple cores The inevitability of multicore processors Increasing transistor countDiminishing return in performance 1. Overview – inevitability of multicores (6) with doubling of core numbers ~ every two years
10
Figure: Spreading Intel’s multicore processors [3] 1. Overview – inevitability of multicores (7)
11
1. Overview – inevitability of multicores (8) Figure 1.1: Main classes of multicore/manycore processors Desktops Heterogenous multicores Homogenous multicores Multicore processors Manycore processors Servers with >8 cores Conventional multicores Master/slave architectures Add-on architectures MPC CPU GPU 2 ≤ n ≤ 8 cores General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future
12
2. Homogeneous multicores 2.1 Conventional multicores 2.2 Manycore processors
13
2. Homogeneous multicores Desktops Heterogenous multicores Homogenous multicores Multicore processors Manycore processors Servers with >8 cores Conventional multicores Master/slave architectures Add-on architectures MPC CPU GPU 2 ≤ n ≤ 8 cores General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future Figure 2.1: Main classes of multicore/manycore processors
14
2.1 Conventional multicores Multicore MP servers Intel’s multicore MP servers AMD’s multicore MP servers
15
2.1 Intel’s multicore MP servers (1) Figure 2.1.1: Intel ’ s Tick-Tock development model [13] The evolution of Intel’s basic microarchitecture
16
Figure 2.1.2: Overview of Intel ’ s Tick-Tock model and the related MP servers [24] 11/2005: First DC MP Xeon 1Q/2009 7100 (Tulsa) 7300 (Tigerton QC) 7400 (Dunnington) 7xxx (Beckton) (Potomac) 7000 (Paxville MP) (Cransfield) 7200 (Tigerton DC) 2x1 C 1 MB L2/C 16 MB L3 2x2 C 4 MB L2/C 1x6 C 3 MB L2/2C 16 MB L3 1x8 C ¼ MB L2/C 24 MB L3 1x1 C 8 MB L2 2x1 C ½ MB L2/C 1x1 C 1 MB L2 1x2 C 4 MB L2/C 3/2005: First 64-bit MP Xeons 90nm TICK Pentium 4 /Prescott) TOCK Pentium 4 /Irwindale) 2.1 Intel’s multicore MP servers (2) Intel’s Tick-Tock model for MP servers
17
Figure 2.1.3: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem) Preceding NBs Xeon MP 1 2.1 Intel’s multicore MP servers (3) SC 1 Xeon MP before Potomac Typically HI 1.5 (266 MB/s) System architecture(before Potomac)
18
MP Platforms Xeon 7000 11/2005 MP Cores Xeon 7100 8/2006 MP Chipsets 3/2005 4/2006 8500 8501 (Paxville MP DC)(Tulsa DC) (Twin Castle) (?) Figure 2.1.4: Intel’s Xeon-based MP server platforms 2xFSB 667 MT/s 4 x XMB (2 x DDR2) 32GB 2xFSB 800 MT/s 4 x XMB (2 x DDR2) 32GB Truland 65 nm/1328 mtrs 2x1 MB L2 16/8/4 MB L3 800/667 MT/s mPGA 604 P4-based/65 nm 3/2005 Xeon MP 3/2005 (Potomac SC) 90 nm/2x169 mtrs 2x1 (2) MB L2 - 800/667 MT/s mPGA 604 90 nm/675 mtrs 1 MB L2 8/4 MB L3 667 MT/s mPGA 604 P4-based/90 nm Truland 2.1 Intel’s multicore MP servers (4) First 64-bit server
19
Figure 2.1.5: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem) Preceding NBs Xeon MP 1 2.1 Intel’s multicore MP servers (5) SC 1 Xeon MPs before Potomac Typically HI 1.5 (266 MB/s) Truland DC Cransfield SC) Tulsa (DC) 3 The 8500 supports also 2 First x86-64 MP processor Up to 2005 2005 Serial link (Twin Castle) XMB 8500/8501 28 PCIe lanes + HI 1.5 Potomac 2 Paxville MP 3 DC/SC Potomac 2 Paxville MP 3 DC/SC Potomac 2 Paxville MP 3 DC/SC Potomac 2 Paxville MP 3 DC/SC (266 MT/s) (7 GT/s) External Memory Bridge
20
MP Platforms Xeon 7000 11/2005 MP Cores Xeon 7200Xeon 7300 Xeon 7100 9/2007 8/2006 MP Chipsets 3/2005 4/2006 9/2007 8500 8501 7300 (Paxville MP DC)(Tulsa DC) (Tigerton DC) (Tigerton) QC Caneland 9/2007 (Clarksboro) (Twin Castle) (?) Figure 2.1.6: Intel’s Xeon-based MP server platforms 2xFSB 667 MT/s 4 x XMB (2 x DDR2) 32GB 2xFSB 800 MT/s 4 x XMB (2 x DDR2) 32GB 4xFSB 1066 MT/s 4 x FBDIMM (DDR2) 512GB Truland Xeon 7400 9/2008 (Dunnington 6C) 65 nm/1328 mtrs 2x1 MB L2 16/8/4 MB L3 800/667 MT/s mPGA 604 65 nm/2x291 mtrs 2x4 MB L2 - 1066 MT/s mPGA 604 65 nm/2x291 mtrs 2x(4/3/2) MB L2 - 1066 MT/s mPGA 604 45 nm/1900 mtrs 9/6 MB L2 16/12/8 MB L3 1066 MT/s mPGA 604 P4-based/65 nmCore2-based/65 nmCore2-based/45 nm 3/2005 Xeon MP 3/2005 (Potomac SC) 90 nm/2x169 mtrs 2x1 (2) MB L2 - 800/667 MT/s mPGA 604 90 nm/675 mtrs 1 MB L2 8/4 MB L3 667 MT/s mPGA 604 P4-based/90 nm TrulandCaneland 7300 2.1 Intel’s multicore MP servers (6)
21
Figure 2.1.7: Evolution of Intel’s Xeon MP-based system architecture (until the appearance of Nehalem) Preceding NBs Xeon MP 1 (Clarksboro) Tigerton 2.1 Intel’s multicore MP servers (7) 6C/QC/DC SC Dunnington 8 PCI-E lanes + ESI Truland Caneland 7300 1 Xeon MP before Potomac Cransfield SC) Tulsa (DC) 3 The 6500 supports also 2 First x86-64 MP processor Typically HI 1.5 (266 MB/s) (2 GT/s)(1 GT/s) QC/8C FB-DIMM (DDR2) XMB 28 PCIe lanes + HI 1.5 Potomac 2 Paxville MP 3 DC/SC Potomac 2 Paxville MP 3 DC/SC Potomac 2 Paxville MP 3 DC/SC Potomac 2 Paxville MP 3 DC/SC (266 MT/s) (7 GT/s) (Twin Castle) 8500/8501 DC Up to 2005 2005 2007
22
2.1 Intel’s multicore MP servers (8) Figure 2.1.8: Nehalem ’ s key innovations concerning the system architecture [22] Nehalem’s key innovations concerning the system architecture (11/2008)
23
2.1 Intel’s multicore MP servers (9) Figure 2.1.9: Nehalem ’ s key innovations concerning the system architecture [22] Nehalem’s key innovations concerning the system architecture (11/2008)
24
Beckton 8C 2.1 Intel’s multicore MP servers (10) QPI QPI: QuickPath Interconnect QPI Figure 2.1.10: Intel ’ s Nehalem based MP server architecture 4xFB-DIMM 11/2008: Nehalem
25
AMD’s multicore MP servers
26
2.1 AMD’s multicore MP servers (1) AMD Direct Connect Architecture (2003) Integrated Memory Controller Serial HyperTransport links Figure 2.1.11: AMD ’ s Direct Connect Architecture [14] Remark 3 HT 1.0 links at introduction (K8), 4 HT 3.0 links with K10 (Barcelona) Introduced in 2003 along with the x86-64 ISA extension (Intel: 2008 with Nehalem)
27
2.1 AMD’s multicore MP servers (2) Use of available HyperTransport links [44] UPs Each link supports connections to I/O devices DPs Two links support connections to I/O devices, any one of the three links may connect to another DP or MP processor MPs Each link supports connections to I/O devices or other DP or MP processors
28
AMD Opteron PCI- X PCI Express AMD Opteron PCI AMD Opteron PCI-X I/O RDD2 HT Figure 2.1.12: 2P and 4P server architectures based on AMD ’ s Direct Connect Architecture [15], [16] 2.1 AMD’s multicore MP servers (3)
29
Figure 2.1.13: Block diagram of Barcelona (K10) vs K8 [17] (K10) 2.1 AMD’s multicore MP servers (4)
30
Figure 2.1.14: Possible use of Barcelona ’ s four HT 3.0 links [39] 2.1 AMD’s multicore MP servers (5)
31
Novel features of HT 3.0 links, such as Current platforms (2. Gen. Socket F with available chipsets) do not support HT3.0 links [46]. higher speed or splitting a 16-bit HT link to two 8-bit links can be utilized only with a new platform. 2.1 AMD’s multicore MP servers (6)
32
Figure 2.1.15: AMD ’ s roadmap for server processors and platforms [19] 2.1 AMD’s multicore MP servers (7)
33
2.2 Manycore processors
34
Desktops Heterogenous multicores Homogenous multicores Multicore processors Manycore processors Servers with >8 cores Conventional multicores Master/slave architectures Add-on architectures MPC CPU GPU 2 ≤ n ≤ 8 cores General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future 2.2 Manycore processors Figure 2.2.1: Main classes of multicore/manycore processors
35
2.2 Manycore processors Intel’s Larrabee Intel’s Tiled processor
36
Larrabee Part of Intel’s Tera-Scale Initiative. Project started ~ 2005 First unofficial public presentation: 03/2006 (withdrawn) First brief public presentation 09/07 (Otellini) [29] First official public presentations: in 2008 (e.g. at SIGGRAPH [27]) Due in ~ 2009 Performance (targeted): 2 TFlops Brief history: Objectives: Not a single product but a base architecture for a number of different products. High end graphics processing, HPC 2.2 Intel’ Larrabee (1)
37
Figure 2.2.2: Block diagram of the Larrabee [4] Basic architecture Cores: In order, 4-way multithreaded x86 IA cores, augmented with SIMD-16 capability L2 cache: fully coherent Ring bus: 1024 bits wide 2.2 Intel’ Larrabee (2)
38
Figure 2.2.5: Larrabee vs the Pentium [11] 64-bit instructions 4-way multithreaded (with 4 register sets) addition of a 16-wide (16x32-bit) VU increased L1 caches (32 KB vs 8 KB) access to its 256 KB local subset of a coherent L2 cache ring network to access the coherent L2 $ and allow interproc. communication. Main extensions 2.2 Intel’ Larrabee (3)
39
Figure 2.2.3: Block diagram of the Vector Unit [5] The Vector Unit VU scatter-gather instructions (load a VU vector register from 16 non-contiguous data locations from anywhere from the on die L1 cache without penalty, or store a VU register similarly). 8-bit, 16-bit integer and 16 bit FP data can be read from the L1 $ or written into the L1 $, with conversion to 32-bit integers without penalty. Numeric conversions L1 D$ becomes as an extension of the register file Mask registers have one bit per bit lane, to control which bits of a vector reg. or memory data are read or written and which remain untouched. 2.2 Intel’ Larrabee (4)
40
Figure 2.2.4: Layout of the 16-wide vector ALU [5] ALUs execute integer, SP and DP FP instructions Multiply-add instructions are available. ALUs 2.2 Intel’ Larrabee (5)
41
Figure 2.2.6: System architecture of a Larrabee based 4-processor MP server [6] 2.2 Intel’ Larrabee (6) CSI: Common Systems Interface (Serial packet-based bus)
42
2.2 Intel’ Larrabee (7) Programming of Larrabee [5] Larrabee has x86 cores with an unspecified ISA extension,
43
2.2 Intel’ Larrabee (8) Figure 2.2.7: Intel’s ISA extensions [11] AES: Advanced Encryption Standard AVX: Advanced Vector Extension FMA: FP fused multiply-add instr. supporting 256-bit/128-bit SIMD
44
2.2 Intel’ Larrabee (9) Programming of Larrabee [5] Larrabee has x86 cores with an unspecified ISA extension, x86 cores allow to program Larrabee as usual x86 processors, by using enhanced C/C++ compilers from MS, Intel, GCC etc. this is a huge advantage compared to the competition (Nvidia, AMD/ATI),
45
Intel’s Tiled processor
46
First implementation of Intel’s Tera-Scale Initiative Announced at IDF Fall 20069/2006 Details at ISSCC 20072/2007 Due to2009/2010 Aim: Tera-Scale research chip - high bandwidth interconnect - energy management - programming manycore processors (among more than 100 projects) Milestones of the development: Tiled Processor 2.2 Intel’s Tiled processzor (1) Remark Based on ideas of the Raw processor (MIT)
47
Figure 2.2.8: Basic structure of the Tiled Processor [7] 2.2 Intel’s Tiled processzor (2)
48
2 single precision FP (Multiply-Add) Figure 2.2.9: Block diagram of a tile [7], [9] VLIW microarchitecture? 2.2 Intel’s Tiled processzor (3) (For debugging) SP FP cores
49
2.2 Intel’s Tiled processzor (4) Figure 2.2.10: Die shot of the Tiled Proc. [8]
50
Figure 2.2.13: Ring based interconnect network topology [7] 2.2 Intel’s Tiled processzor (5)
51
Figure 2.2.14: Mesh interconnect topology [7] 2.2 Intel’s Tiled processzor (6)
52
Figure 2.2.11: Integration of dedicated hardware units (accelerators) [7] 2.2 Intel’s Tiled processzor (7)
53
2.2 Intel’s Tiled processzor (8) Figure 2.2.12: Sleeping inactivated cores [7]
54
2.2 Intel’s Tiled processzor (9) Figure 2.2.15: Performance figures of the Tiled Processor [7] Matrix multiplication (Single Precision) Peak performance 4 SP FP/cycle at 4 GHz: 1.6 TFLOPS
55
3. Heterogenous multicores 3.1 Master/slave architectures 2.2 Attached architectures
56
3. Heterogenous multicores Figure 3.1: Main classes of multicore processors Desktops Heterogenous multicores Homogenous multicores Multicore processors Manycore processors Servers with >8 cores Conventional MC processors Master/slave architectures Add-on architectures MPC CPU GPU 2 ≤ n ≤ 8 cores General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future
57
3.1 Master/slave architectures The Cell BE
58
3.1 The Cell BE (1) Computational model Master-slave computational model with cacheless private memory spaces (LSs) allow efficient utilization of the die area for computations transferring the tasks (programs and data) from the master to the slaves and the results back from the slaves to the master Master slave computational model allows to delegate tasks to dedicated, task-efficient units synchronization between the master and the slaves inter-core communication and synchronization Cacheless private memory spaces needs efficient mechanisms for needs an efficient LS-based microarchitecture for the slaves
59
Performance @ 3.2 GHz: QS21 Peak performance (SP FP): 409,6 GFlops (3.2 GHz x 2x8 SPE x 2x4 SP FP/cycle) 3.1 The Cell BE (2)
60
3.1 The Cell BE (3) Figure 3.1.2: Cell roadmap from 2007 [22]
61
3.2 Attached architectures
62
Figure 3.2.1: Main classes of multicore/manycore processors Desktops Heterogenous multicores Homogenous multicores Multicore processors Manycore processors Servers with >8 cores Conventional MC processors Master/slave architectures Add-on architectures MPC CPU GPU 2 ≤ n ≤ 8 cores General purpose computing Prototypes/ experimental systems MM/3D/HPC production stage HPC near future 3.2 Attached architectures
63
Introduction to GPGPUs The SIMT computational model (CM) Recent implementations of the SIMT CM Intel’s future processors with attached architecture AMD’s future processors with attached architecture
64
Introduction to GPGPUs
65
Figure 3.2.2: Evolution of the microarchitecture of GPUs [23] 3.2 Introduction to GPGPUs (1) Evolution of the microarchitecture of GPUs
66
Figure 3.2.3: Simplified block diagram of AMD/ATI’s RV770 [24] 160 cores x 5 execution units 3.2 Introduction to GPGPUs (2)
67
Figure 3.2.4: Simplified structure of a core of the RV770 GPGPU [24] Execution units (Stream Processing Units) 32-bit FP (ADD, MUL, MADD) 64-bit FP 32-bit FX. 3.2 Introduction to GPGPUs (3)
68
3.2 Introduction to GPGPUs (4) Figure 3.2.5: Peak SP FP performance figures Nvida’s GPUs vs Intel’s CPUs [25]
69
3.2 Introduction to GPGPUs (5) Figure 3.2.6: Bandwidth figures: Nvidia’s GPUs vs Intel’s CPUs [GB/s] [25]
70
Not cached Figure 3.2.7:Utilization of the die are in CPUs vs GPUs [25] 3.2 Introduction to GPGPUs (6)
71
Based on their FP32 computing capability and the large number of execution units available GPUs with unified shader architecture are prospective candidates for speeding up HPC! GPUs with unified shader architectures also termed as GPGPUs (General Purpose GPUs) 3.2 Introduction to GPGPUs (7) For HPC computations SIMT (Single Instruction Multiple Treads) computation model Use of GPUs for HPC
72
The SIMT computational model (CM)
73
Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices) Figure 3.2.8: Main alternatives of data parallel execution 3.2 The SIMT computational model (1)
74
Scalar execution SIMD execution SIMT execution Domain of execution: single data elements Domain of execution: elements of vectors Domain of execution: elements of matrices (at the programming level) Figure 3.2.9: Scope of the data parallel execution vs scalar execution (at the programming level) Remarks 1.SIMT execution is also termed as SPMD (Single-Program Multiple-Data) execution (Nvidia) 2.At the processor level two dimensional domains of execution can be mapped to any set of cores (e.g. to a line of cores). 3.2 The SIMT computational model (2)
75
Main alternatives of data parallel execution Data parallel execution SIMD execution SIMT execution One dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input vectors One/two dimensional data parallel execution, i.e. it performs the same operation on all elements of given FX/FP input arrays (vectors/matrices) E.g. 2. and 3. generation superscalars GPGPUs, data parallel accelerators Figure 3.2.10: Main alternatives of data parallel execution data dependent flow control as well as barrier synchronization is massively multithreaded, and provides 3.2 The SIMT computational model (3)
76
Recent implementations of the SIMT CM
77
Basic implementation alternatives of the SIMT execution GPGPUs Data parallel accelerators Dedicated units supporting data parallel execution with appropriate programming environment Programmable GPUs with appropriate programming environments E.g.Nvidia’s 8800 and GTX lines AMD’s HD 38xx, HD48xx lines Nvidia’s Tesla lines AMD’s FireStream lines Have display outputs No display outputs Have larger memories than GPGPUs Figure 3.2.12: Basic implementation alternatives of the SIMT execution 3.2 Recent implementations of the SIMT CM (1)
78
GPGPUs Nvidia’s line AMD/ATI’s line Figure 3.2.13: GPGPU families of Nvidia and AMD/ATI’ 90 nm G80 65 nm G92G200 Shrink Enhanced arch. 80 nm R600 55 nm RV670RV770 Shrink Enhanced arch. 3.2 Recent implementations of the SIMT CM (2)
79
48 ALUs 6/08 65 nm/1400 mtrs 11/06 90 nm/681 mtrs Cores Cards CUDA Cores G80 2005200620072008 96 ALUs 320-bit 8800 GTS 10/07 65 nm/754 mtrs G92 128 ALUs 384-bit 8800 GTX 112 ALUs 256-bit 8800 GT GT200 192 ALUs 448-bit GTX260 240 ALUs 512-bit GTX280 6/07 Version 1.0 11/07 Version 1.1 6/08 Version 2.0 5/08 55 nm/956 mtrs 5/07 80 nm/681 mtrs R600 11/07 55 nm/666 mtrs R670 RV770 11/05 R500 320 ALUs 512-bit HD 2900XT 320 ALUs 256-bit HD 3850 320 ALUs 256-bit HD 3870 800 ALUs 256-bit HD 4850 800 ALUs 256-bit HD 4870 Cards (Xbox) 11/07 Brook+ Brooks+ RapidMind 2009 NVidia AMD/ATI 6/08 support 3870 Figure 3.2.14: Overview of GPGPUs 3.2 Recent implementations of the SIMT CM (3)
80
Implementation alternatives of data parallel accelerators On card implementation Recent implementations E.g. GPU cards Nvidia’s Tesla AMD/ATI’s FireStream accelerator families Figure 3.2.15: Implementation alternatives of data parallel accelerators Data parallel accelerators 3.2 Recent implementations of the SIMT CM (4)
81
6/08 GT200-based 4 GB GDDR3 0.936 GLOPS 6/07 G80-based 1.5 GB GDDR3 0.519 GLOPS Card Desktop IU Server C870 20072008 C1060 CUDA NVidia Tesla 6/07 G80-based 2*C870 incl. 3 GB GDDR3 1.037 GLOPS D870 6/07 G80-based 4*C870 incl. 6 GB GDDR3 2.074 GLOPS S870 6/07 Version 1.0 6/08 GT200-based 4*C1060 16 GB GDDR3 3.744 GLOPS S1070 11/07 Version 1.01 6/08 Version 2.0 Figure 3.2.16: Overview of Nvidia’s Tesla family 3.2 Recent implementations of the SIMT CM (5)
82
6/08 Shipped 11/07 RV670-based 2 GB GDDR3 500 GLOPS FP32 ~200 GLOPS FP64 Card Stream Computing SDK 9170 20072008 9170 Rapid Mind AMD FireStream 6/08 RV770-based 1 GB GDDR3 1 TLOPS FP32 ~300 GFLOPS FP64 9250 12/07 Brook+ ACM/AMD Core Math Library CAL (Computer Abstor Layer) Version 1.0 10/08 Shipped 9250 Figure 3.2.17: Overview of AMD/ATI’s FireStream family 3.2 Recent implementations of the SIMT CM (6)
83
Implementation alternatives of data parallel accelerators On-die integration On card implementation Recent implementations Future implementations E.g. GPU cards Nvidia’s Tesla AMD/ATI’s FireStream accelerator families Intel’s Heavendahl AMD’s Fusion integration technology Trend Figure 3.2.15: Implementation alternatives of data parallel accelerators Data parallel accelerators 3.2 Recent implementations of the SIMT CM (4)
84
3.2 Recent implementations of the SIMT CM (7) Figure 3.2.18: Expected evolution of attached GPGPUs [42] Integration to the chip
85
Intel’s future processors with attached architecture
86
3.2 Intel’s future processors with attached architecture (1) Figure 3.2.19: Intel’s desktop roadmap [26] Core2i7 (Nehalem)Pentium4 Q4/08
87
3.2 Intel’s future processors with attached architecture (2) Figure 3.2.20: A part of Intel’s desktop roadmap [26] Q2/09 Q3/09 Q4/08Q1/09 (45 nm)
88
AMD’s future processors with attached architecture
89
3.2 AMD’s future processors with attached architecture (1) Figure 3.2.21: AMD’s view about the major phases of processor evolution [27]
90
6/2006 The Torrenza initiative (2006 Technology Analyst Day) Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40]. 3.2 AMD’s future processors with attached architecture (4)
91
Figure 3.2.22: Introduction of the Torrenza platform level integration technique [40] (cache coherent HT) 3.2 AMD’s future processors with attached architecture (3)
92
6/2006 The Torrenza initiative (2006 Technology Analyst Day) Platform level integration of accelerators in AMD’s multi-socket systems via cache coherent HyperTransport systems [40]. 10/2006 Acquisition of ATI 10/2006 The Fusion initiative Silicon level integration of accelerators to AMD processors (first Fusion processors due to the end of 2008 early 2009) [41] 3/2007 “Integration” of the Torrenza and the Fusion initiatives into a continuum of accelerated computing solutions 3.2 AMD’s future processors with attached architecture (4)
93
Figure 3.2.23: The Torrenza platform and the Fusion integration technology as a continuum for accelerated computing solutions [29] Remark: It is based on an earlier Alienware presentation from 6/2006 [38]. 3.2 AMD’s future processors with attached architecture (5)
94
Implementation of Fusion processors 2007/2008 AMD made a number of confusing announcements and withdrawals [31] – [35]. According to the latest announcements (11/2008) AMD plans to introduce 32 nm Fusion processors only in 2011 [37]. 3.2 AMD’s future processors with attached architecture (6)
95
Figure 3.2.24: AMD’ 2008 roadmap for client processors [37] 3.2 AMD’s future processors with attached architecture (7)
96
4. Outlook
97
4. Outlook (1) Outlook Heterogenous multicores Master/slave architectures Add-on architectures 1(Ma):M(S)2(Ma):M(S)M(Ma):M(S) 1(CPU):1(D)M(CPU):1(D)M(CPU):M(D) Ma: Master S: Slave M: Many D: Dedicated (like GPU) H: Homogenous M: Many M(Ma) = M(CPU) M(S) M(D) ? Figure 4.1: Expected evolution of heterogeneous multicore processors The future of heterogenous multicores
98
4. Outlook (2) M(CPU):M(D) Heterogenous multicores:
99
The future of homogeneous multicores Larrabee Tiled processor In fact: both are of the same type: M(CPU):M(D) 4. Outlook (3) Figure 4.2: Simplified block diagrams of Larrabee and the Tiled processor [4], [7]
100
4. Outlook (4) The main road of processor evolution M(CPU):M(D)
101
Thank you for your attention!
102
5. References [1]: Bhandarkar D., „The Dawn of a New Era”, 11. EMEA, May 2006, Budapest, [2]: Wall D. W., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-TN-15.html [3]: Loktu A.,”Itanium 2 for Enterprise Computing,” http://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.ppshttp://h40132.www4.hp.com/upload/se/sv/Itanium2forenterprisecomputing.pps [4]: Stokes J., Larrabee: Intel’s biggest leap ahead since the Pentium Pro,” Aug. 04. 2008, http://arstechnica.com/news.ars/post/20080804-larrabee- intels-biggest-leap-ahead-since-the-pentium-pro.html [5]: Shimpi A. L. C Wilson D., “Intel's Larrabee Architecture Disclosure: A Calculated First Move, Anandtech, Aug. 4. 2008, http://www.anandtech.com/showdoc.aspx?i=3367&p=2 [6]: Timm J.-F., “Larrabee: Fakten zur Intel Highend-Grafikkarte,” Computer Base, 2. Juni 2007, http://www.computerbase.de/news/hardware/grafikkarten/2007/juni/larrabee_fakten_ intel_highend-grafikkarte/ [7]: Shrout R., “Intel’s 80 Core Terascale Chip Explored: 4 GHz Clocks and more,” PC Perspective, Feb. 11. 2007,Shrout http://www.pcper.com/article.php?aid=363 [8]: Goto H., “Intel’s Manycore CPUs,” PC Watch, June 11. 2007, http://pc.watch.impress.co.jp/docs/2007/0611/kaigai364.htm
103
[9]: Hoskote Y. & al., “A 5-GHz Mesh Interconnect for a Teraflops Processor,” IEEE Micro, Sept./Oct. 2007, (Vol. 27 No. 5), pp. 51-61 [10]: Taylor M. & al., “The Raw Processor,” Hot Chips Aug. 13. 2001, http://www.hotchips.org/archives/hc13/3_Tue/22mit.pdf [11]: Goto H., Larrrabee architecture can be integrated into CPU”, PC Watch, Oct. 06. 2008, http://pc.watch.impress.co.jp/docs/2008/1006/kaigai470.htm [12]: Stokes J., “Larrabee: Intel's biggest leap since the Pentium Pro,” Ars Technica,Larrabee: Intel's biggest leap since the Pentium Pro Aug. 04, 2008, http://arstechnica.com/news.ars/post/20080804-larrabee-intels-http://arstechnica.com/news.ars/post/20080804-larrabee-intels- biggest-leap-ahead-since-the-pentium-pro.html [13]: Singhal R., “ Next Generation Intel Microarchitecture (Nehalem) Family: Architecture Insight and Power Management, IDF Taipeh, Oct. 2008, http://intel.wingateweb.com/taiwan08/published/sessions/TPTS001/FA08%20IDF -Taipei_TPTS001_100.pdf [14]: AMD Opteron™ Processor for Servers and Workstations, http://amd.com.cn/CHCN/Processors/ProductInformation/0,,30_118_8826_8832,http://amd.com.cn/CHCN/Processors/ProductInformation/0,,30_118_8826_8832 00-1.html [15]: AMD Opteron Processor with Direct Connect Architecture, 2P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/2P_Power_PID_41497.pdf [16]: AMD Opteron Processor with Direct Connect Architecture, 4P Server Power Savings Comparison, AMD, http://enterprise.amd.com/downloads/4P_Power_PID_41498.pdf
104
[17]: Kanter D., “Inside Barcelona: AMD's Next Generation, Real World Tech., May 16. 2007, http://www.realworldtech.com/page.cfm?ArticleID=RWT051607033728 [18]: Kanter D,, “AMD's K8L and 4x4 Preview, Real World Tech. June 02. 2006, http://www.realworldtech.com/page.cfm?ArticleID=RWT060206035626&p=1 [19]: Enderle R., AMD Shanghai “ We are back! TGDaily, November 13. 2008, http://www.tgdaily.com/content/view/40176/128/ [20] Gshwind M., „Chip Multiprocessing and the Cell BE,” ACM Computing Frontiers, 2006, http://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdfttp://beatys1.mscd.edu/compfront//2006/cf06-gschwind.pdf [21]: Wright C., Henning P., Bergen B., “Roadrunner Tutorial – An Introduction to Roadrunner and the Cell Processor,” Febr. 7 2008, http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-1-web1.pdf [22]: Hofstee H. P., “Industry trends in Microprocessor Design,”, IBM, Oct. 4 2007, http://lanl.gov/orgs/hpc/roadrunner/rrinfo/RR%20webPDFs/Cell_Hofstee_Non_Conf.pdf [23]: Nvidia GeForce 8800 GPU Architecture Overview, Vers. 0.1, Nov. 2006, Nvidia, http://www.nvidia.com/page/8800_tech_briefs.html [24]: Houston M., “Anatomy if AMD’s TeraScale Graphics Engine,”, SIGGRAPH 2008, http://s08.idav.ucdavis.edu/houston-amd-terascale.pdf [25]: Nvidia CUDA Compute Unified Device Architecture Programming Guide, Version 2.0, June 2008, Nvidia
105
[26]: Goto H., “Intel Desktop CPU Roadmap,” 2008, http://pc.watch.impress.co.jp/docs/2008/0326/kaigai02.pdf [27]: The industry-Changing Imact of Accelerated Computing – Fusion White Paper, AMD, 2008, http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf [28]: AMD Announces Initiatives To Elevate AMD64 As Platform For System- And Industry-Wide Innovation, AMD, Jine 1 2006. http://www.amd.com/us-en/Weblets/0,,7832_8366_5730~109409,00.html [29]: Metal G., “AMD Torrenza and Fusion together,” Metalghost, March 22 2007, http://www.metalghost.ro/index.php?view=article&catid=30%3Ahardware&id =233%3Aamd-torrenza-and-fusion-together&option=com_content IT Hardware IT NewsIT ReviewsProcessorsMotherboardsMemoriesGraphic CardsStorage DevicesDisplaysDigital Audio Devices Web Design Ghost Web SitesMetal BandsHardware & SoftwarePetrochemicalBucovinaStep1 Media PromotionKinder LandFloridan TransMiniclip OnlineShipcare ServicesReebok Mania Latest ArticlesMost Popular [30]: Hester P., “Multi_Core and Beyond: Evolving the x86 Architecture,” Hot Chips 19, Aug. 2007, http://www.hotchips.org/hc19/docs/keynote2.pdf [31]: Hester P., 2007 Technology Analyst Day, AMD, July 26 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf [32]: Rivas M., 2007 Financial Analyst Day, AMD, Dec. 13. 2007, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007 _AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf [33]: Smalley T., “Shrike is AMD’s First Fusion Platform,”, Trusted Reviews, June 9. 2008, http://www.trustedreviews.com/notebooks/news/2008/06/09/Shrike-Is-AMDs- First-Fusion-Platform/p1
106
[34]: Hruska J., “AMD Fusion now pushed back to 2011,” Ars Technica, Nov. 14. 2008,AMD Fusion now pushed back to 2011 http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to -2011.html [35]: Gruener W., “AMD delays Fusion processor to 2011,” TgDaily, Nov. 13. 2008, http://www.tgdaily.com/content/view/40186/135 [36]: Wilson D., “AMD Analyst Day Platform Announcements,” Anandtech, June 2. 2006, http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2 [37]:Allen R., Financial Analyst Day, AMD, Nov. 13. 2008, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/RandyAllen AMD2008AnalystDay11-13-2008.pdf [38]: Gonzales N., 2006 Technology Analyst Day, Alienware, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf [39]: Hester P., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/PhilHester AMDAnalystDayV2.pdf [40]: Seyer M., 2006 Technology Analyst Day, AMD, June 1. 2006, http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MartySeyer AMDAnalystWebv3.pdf [41]: AMD Completes ATI Acquisition and Creates Processing Powerhouse, Oct. 25. 2006, http://www.amd.com/us-en/Corporate/VirtualPressRoom/0,,51_104_543 ~113741,00.html
107
[42]: Stokes J., “A closer look at AMD’s CPU/GPU Fusion,” Ars Technica, Nov. 19. 2006, http://arstechnica.com/news.ars/post/20061119-8250.html
108
842-856 (Athens) 82xx (Santa Rosa) 8347-56 (Barcelona) 840-850 (Sledgehammer) 865-890 (Egyipt) 1x1 C 1 MB L2 1x2 C 2 MB L2/C 1x4 C 1/2 MB L2/C 2 MB L3 3 MB L3 1x1 C 1 MB L2 1x2 C 2 MB L2/2C 8378-84 (Shanghai) Figure: AMD’s Tick-Tock model and the related Opteron MP servers TOCK TICK TOCK TICK 45nm 65nm 90nm 130nm
109
Figure: Larrabee’s Software stack [12]
110
Figure: Layout of MIT’s Raw Processor [10]
111
3.1 The Cell BE (1) Cell BE 02/2006 Cell Blade QS20 08/2007 Cell Blade QS21 05/2008 Cell Blade QS22 Joint development of Sony, IBM and Toshiba 11/2006 Playstation 3 (PS3) QS2x Blade Server family 2x PSX3 performance 12 cores/45 nm GDDR3/DDR3 (instead of XDR) Summer 2000: Basic decisions concerning the architecture Aim: Games, multimedia, in addition HPC Rumors (9/2008): 2011? Playstation4 (Competition: XBox3)
112
EIB: Element Interface Bus Figure 3.1.1: Block diagram of the Cell BE [20] SPE: Synergistic Procesing Element SPU: Synergistic Processor Unit SXU: Synergistic Execution Unit LS: Local Store of 256 KB SMF: Synergistic Mem. Flow Unit PPE: Power Processing Element PPU: Power Processing Unit PXU: POWER Execution Unit MIC Memory Interface Contr. BIC: Bus Interface Contr. XDR: Rambus DRAM 3.1 The Cell BE (3)
113
Figure: Layout of the EIB [21] 3.1 Mester/szolga elvű többmagos processzorok - A Cell (4)
114
Figure: Concurrent data transfers over the EIB [21] 3.1 Mester/szolga elvű többmagos processzorok - A Cell (5)
115
Massive multithreading Multithreading is implemented by creating and managing parallel executable threads for each data element of the execution domain. Figure: Threads allocated to the element of an execution domain Same instructions for all data elements 3.2 Csatolt többmagos processzorok (10)
116
ALU CTX Actual context Register file (RF) Context switch Figure 3.2.11: Per thread contexts needed per ALU for fast context switch Fetch/Decode SIMT core 3.2 The SIMT computational model (4)
117
Figure: Early vision on the integration of CPUs and GPUs (Presented by Alienware (performance pc maker) [38] Alienware’s early vision on the integration of CPUs and GPUs (6/2006)
118
Figure: AMD’s view about the evolution of mainstream computing [30] 5.2 AMD/ATI’s GPGPU line (1)
119
Figure: AMD’s planned 32 nm mobile/mainstream Falcon Fusion family [31] (32 nm brand new core) AMD’s plans to implement Fusion class processors The 32 nm Falcon processor with the Bulldozer CPU core (7/2007: Technology Analyst Day) (UVD: Unified Video Decoder)
120
The 45 nm Swift processor family (12/2007: Financial Analyst Day) Figure: AMD’s planned 45 nm Swift Fusion processor family [32] (K10)
121
Figure: AMD’s planned 45 nm Shrike mobile platform with the Swift processor [33] The 45 nm Shrike platform with the Swift processor (6/2008)
122
Nov. 2008 (Financial Analyst Day): AMD cancelled both the 45 nm Shrike platform and the Swift processor [34], [35] Reason The 45 nm implementation would result only in modest improvements in performance, power and cost. Recent plan 32 nm technology is awaited to implement the planned CPU GPU integration, it is due to in 2011.
123
5. References [1]: Bhandarkar ic techn [2]: Wall D., “Limits of ILP,” WRL, TN-15, Dec. 1990, DECl [3]: Intel mc
124
Large-Scale Systems Modeling: Networks of QS2x Blades Peter Altevogt, Tibor Kiss IBM STG Boeblingen Wolfgang Denzel IBM Research Zurich Miklos Kozlovszky Budapest Tech
125
Research objectives Provide simulation infrastructure for a detailed modification analysis of IO subsystems, networks and workloads limited modification analysis of processor cores: as workload generators they are treated as black (grey) boxes workload characterization is based on low-level processor core simulations or measurements Subtasks High-Level Simulation Design of Networks of QS2x Blades System representation Workload representation Implementation
126
Modeled Components Workload as generated by the processor cores System components: processor cores* as workload generators for executing computational delays memory and IO subsystems bus interfaces, southbridges, network adapter network switches, router,... * without bus interfaces
127
Network General Setup Blades... : requests
128
High-Level Simulation Design Blade system: hardware view EIB1 EIB0 mem1 mem0 SB1 SB0 Cores0 Cores1 Processor cores Southbridges Network adapter Buses Memories to/from network Processor cores: – generating requests against IO subsystem / network – executing computational requests in form of delays
129
High-Level Simulation Design (2): Blade system: detailed simulation view Processor cores (2 chips in case of Blades) netw mem1 EIB1 SB1 mem0 EIB0 SB0 IO subsystem Adaptive workload generator network Workload generator: – generating requests against IO subsystem / network Processor cores: – executing computational requests in form of delays
130
Figure: Overview of the implementation of Intel ’ s Tick-Tock model for MP servers [24] 2x1 C, 1 MB L2/C 16 MB L3 7100 (Tulsa) 1x2 C, 4 MB L2/C 7200 (Tigerton DC) 2x2 C, 4 MB L2/C 7300 (Tigerton QC) 1x6 C, 3 MB L2/2C 16 MB L3 7400 (Dunnington) 1x8 C, ¼ MB L2/C 24 MB L3 7xxx (Beckton) 2. Intel’s MP servers (5) TICK Pentium 4 /Prescott) 1x1 C, 8 MB/C (Potomac) TOCK Pentium 4 /Irwindale) 2x1 C, ½ MB/C 7000 (Paxville MP) 1x1 C, 1 MB/C (Cransfield) 90nm 3/2005: First 64-bit MP Xeon 11/2005: First DC MP Xeon 1Q/2009
131
2.2 ábra: Intel MP szerver lapka készleteinek fejlődése Preceding NB Potomac Clarksboro Tigerton (Twin Castle) Paxville MP Tulsa XMB Paxville MP Tulsa Paxville MP Tulsa Paxville MP Tulsa 8500 DC/QC SC DC 2005: 2006: 2007: DDR/ DDR2 FBDIMM/DDR2 DDR/ DDR2 2.1 – Intel többmagos MP szerver processzorai (2) 7300 FSB
132
FB-DIMM DDR2 192 GB 7200 DC 7300 QC (Tigerton) Xeon 2.3 ábra: Négyfoglalatos 7300 (Caneland) alaplap (Supermicro X7QC3) SBE2 SB 7300 NB 2.1 – Intel többmagos MP szerver processzorai (3)
133
UP: Opteron 100/1000DP: Opteron 200/2000, MP: 800/8000 2.4 ábra: Az Opteron család alapvető felépítése 2.1 – AMD többmagos MP szerver processzorai (1)
134
2.5 ábra: AMD 4P/8P Direct Connect szerver architektúrája 2.1 – AMD többmagos MP szerver processzorai (2)
135
2.1 – AMD többmagos MP szerver processzorai (3) 2.6 ábra: Intel Nehalem processzcsaládjának (2008 nov.17) rendszer architektúrája On-die Memory Controller
136
a) Heterogeneous MCP rather than being a symmetrical MCP (as usual implementations) The PPE is optimized to run a 32/64-bit OS controls usually the SPEs, complies with the 64-bit PowerPC ISA. the PPE is more adept at control-intensive tasks and quicker in task switching, the SPEs are more adept at compute intensive tasks and slower at task switcing. are optimized to run compute intensive SIMD apps., operate usually under the control of the PPE, run their individual apps. (threads), have full access to a coherent shared memory including the memory mapped I/O-space, can be programmed in C/C++. The SPEs Contrasting the PPE and the SPEs Unique features of the Cell BE Overview of the Cell BE (4)
137
b) The SPEs have an unusual storage architecture, as SPEs The LS access main memory (effective address space) by DMA commands, i.e. DMA commands move data and instructions between main store and the private LS, while DMA commands can be batched (up to 16 commands). operate in connection with a local store (LS) of 256 KB, i.e. o they fetch instructions from their private LS and o their Load/Store-instructions access their LS rather than the main store, SPEs has no associated cache. Overview of the Cell BE (5)
138
Figure: Die shot and floorplan the Cell BE (221mm 2, 234 mtrs) [15] 3.1 Mester/szolga elvű többmagos processzorok - A Cell (3)
139
4. Kitekintés (1) Processor Technology Aim Bloomfield (45 nm) desktop Beckton (45 nm) MP server Westmare (32 nm) desktop DP server Cores Memory channels 4 triple channel DDR3 8 quad channel FB_DIMM (2) 4/6 triple channel DDR3 4/6 quad channel DDR3 Intel’s Nehalem (i7) family (17. Nov. 2008) Integrated memory controller 4/6/8-cores Dual-threaded FSB replaced by a serial bus (QuickPath Interconnect) Main features
142
http://pc.watch.impress.co.jp/docs/2007/0122/kaigai330.htm
144
http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG
146
HTX slots will be standard interfaces connected directly to an AMD CPU's HyperTransport link. If both of these links are coherent, the device and the CPU will be able to communicate directly with each other with cache coherency. Because of this, latency can be reduced greatly over other buses as well, enabling hardware vendors to begin to create true coprocessor technology once again. http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2768&p=2
147
http://www.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf Fusion announced in Oct. 2006, due to in 1H 2008. http://www.google.com/search?hl=en&q=amd+fusion+pcwatch&btnG=Google+Search&aq=f&oq=
148
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/July_2007_AMD_Analyst_Day_Phil_Hester-Bob_Drebin.pdf 32nm brand new 32 nm core
149
http://download.amd.com/Corporate/MarioRivasDec2007AMDAnalystDay.pdf
150
fusion constraints die size dissipation memory bandwidth Phil Hester Fusion will never go to high end due to dissipation AMD's CPU die size of the high-end desktop CPU about 200 square mm, with the main stream CPU 120-150 sq mm, the value CPU around 100 square mm or less. Therefore, FUSION half die (semiconductor units) as a spare GPU core, the size of the core GPU with a degree can be constrained. GPU die, the high-end GPU more than 300 square mm, with midrange GPU 120-150 square mm, Value-GPU around 100 square mm or less. Therefore, 45nm process can be integrated into FUSION-generation GPU core, 65nm-generation below the rank of the discrete GPU will be the size and extent cpu uses commodity dram gpu graphics dram gddr3, 4, 5 memory size bandwidth mem. data path 8 B 32/64 B http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm coexistence of torrenza and fusion (high end: torrenza)
151
を真剣に考えているようだ。 FUSION プロセッサの想 定図 PDF 版はこちらこちら http://translate.google.com/translate?hl=en&sl=ja&u=http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm&sa=X&oi=translate&resnum=3&ct=result&prev=/search%3Fq%3Damd%2Bfusion%2Bpcwatch%26hl%3Den%26sa%3DG
152
http://arstechnica.com/news.ars/post/20081114-amd-fusion-now-pushed-back-to-2011.html Nov 14 20008
153
http://www.techpowerup.com/reviews/AMD/Analysts_Day
154
PC Watch, 07.01.31) http://pc.watch.impress.co.jp/docs/2007/0131/kaigai332.htm Memory bandwidth per 10GFLOPS needs 1GB/sec about and will be calculated.
155
The 45 nm Fusion processor, initially promised as a 2009 chip and then moved into 2010 is essentially cancelled. The chip, which was described to combine a CPU and GPU under one hood in the “Shrike” core, was found to only bring modest improvements over today’s platforms in terms of power efficiency, cost and performance. Instead, the company will introduce Fusion (which actually isn’t called Fusion anymore) as a 2011 model in a 32 nm version with Llano core. Allen said that 32 nm would be the right technology to introduce the product. Llano will feature four cores, 4 MB of cache, DDR3 memory support and an integrated GPU. http://www.tgdaily.com/content/view/40186/135/ Nov 13 2008
156
Possible use of surplus transistors Wider processor widthCore enhancementsCache enhancements superscalar branch prediction speculative loads... L2/L3 enhancements (size, associativity...) 1. Gen.2. Gen. 1 2 4 pipeline Doubling transistor counts ~ every two years Utilization of the surplus transistors? Moore’s rule 1. Többmagos processzorok megjelenésének szükségszerűsége (3)
157
Figure: Overview of Intel ’ s Tick-Tock model and the related MP servers [24] TICK Pentium 4 /Prescott) TOCK Pentium 4 /Irwindale) 90nm 11/2005: First DC MP Xeon 1Q/2009 7100 (Tulsa) 7300 (Tigerton QC) 7400 (Dunnington) 7xxx (Beckton) (Potomac) 7000 (Paxville MP) (Cransfield) 7200 (Tigerton DC) 2x1 C 1 MB L2/C 16 MB L3 2x2 C 4 MB L2/C 1x6 C 3 MB L2/2C 16 MB L3 1x8 C ¼ MB L2/C 24 MB L3 1x1 C 8 MB L2 2x1 C ½ MB L2/C 1x1 C 1 MB L2 1x2 C 4 MB L2/C 3/2005: First 64-bit MP Xeons
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.