Download presentation
Presentation is loading. Please wait.
Published byDaniele Gianluigi Giuliano Modified over 5 years ago
1
Charlie Brej APT Group University of Manchester
Group Talk Charlie Brej APT Group University of Manchester
2
I Computers
3
Never Obsolete?
4
Want
5
Cool Forever
6
This Talk History of microprocessors The 1000 chicken computer
What did they do and why did they do it Nostalgia The 1000 chicken computer What’s wrong with the idea Benchmarks of Artois Current challenges Future designs New micro architectures Where are we heading? 88 slides
7
Single-Threaded Performance (MIPS)
Sandy Bridge Core 2 Pentium 4 Nehelem Pentium MMX Pentium III Pentium II Pentium Pentium Pro i486 i386 i286 8086 8080 4004
8
Chapter 1: Make it possible
9
4004: Computer on a chip!
10
4004 Year:1971 2,300 transistors 46 instructions
16 registers of 4 bits each 4 KB memory access
11
1974: Memory was expensive * $6, today’s money
12
8080 Year:1974 78 instructions 7 registers 8-bit each Stack operations
64 KB memory access
13
Jim could do better in software but using up RAM
8086 Year:1978 8 registers 16-bit each 1MB memory access (segmented) Many instuctions but multi cycle: JMP 15 cycles MUL cycles Jim could do better in software but using up RAM
14
Chapter 2: Make it useful
15
i286 Year:1982 Protected mode Linear MMU (Memory Protection Unit)
‘Brain dead chip’ (Bill Gates)
16
i386 Year:1986 32-bit architecture (we still use today)
Virtual memory MMU Performance becomes important Real MUL/DIV, on chip float point
17
Chapter 3: Decrease CPI (Clocks Per Instruction)
19
i486 Year:1989 Pipelining (5 stage) Single cycle instructions
~4 times faster than an i386 Clock multiplier 2x and 3x 8 KB on-chip cache
20
Chapter 4: Increase IPC (Instructions Per Clock)
21
Pentium (P54C, P54CS and MMX)
Year:1993 Superscalar (2 way) MMX (basic SIMD ops) 6 stage pipeline
22
Instructions Per Clock
Pentium MMX with double L1 cache Clock multiplied Pentium
23
Chapter 5: Improve Caches
24
Pentium Pro Year:1995 Speculative OOO execution Ten stage pipeline
On-package L2 cache
25
Pentium Pro Versions 256KB 1024KB 512KB
26
Pentium II Year:1997 Off-chip L2: PII: 512KB half speed
Xeon: 2MB full speed
27
Pentium III (Katmai) Year:1999 (Aug) Short life
28
Pentium III (Coppermine)
Year:1999 (Oct) On chip 256K L2 Or 2MB for servers
29
Pentium III (Tualatin)
Year:2001
30
Pentium M (aka Core) Year:2004 Cache transistors: 66%
31
P6 Range stats Pentium Pro Pentium II Pentium III (Katmai)
Pentium III (Coppermine) Pentium III (Tualatin) Pentium M (Core) Released 11/1995 04/1998 02/1999 10/1999 07/2001 04/2004 Tech 350nm 250nm 180nm 130nm 90nm Transistors 5.5M 7.5M 9.5M 28.1M 44M 140M L1 Cache 16KB 32KB 64KB L2 Cache 256KB 512KB 2048KB Max. Clock 200MHz 450MHz 600MHz 1000MHz 1400MHz 2333MHz FSB speed 66MHz 100MHz 133MHz 667MHz* Clock Mul 3x 4.5x 6x 7.5x 10.5x 13x Cache trans. 18% 27% 21% 46% 55% 66%
32
Cache transistors (%)
33
Chapter 6: Increase Clock Speed
34
Pentium 4 (Willamette) Year:2000
Super-pipelined: 31 stage pipeline (!)
35
Pentium 4 (Northwood) Year:2002 Hyper-Threading
36
Pentium 4 (Prescott) Year:2004 Never reached its potential
37
PIII (Katmai) and P4 (Willamette) Insufficient time for tech ramp
Gate Delays per Clock Pentium: 40 Pentium III onward: 15.2 PIII (Katmai) and P4 (Willamette) Insufficient time for tech ramp Northwood: 6.5 Prescott: 8.7 (too hot)
38
Power consumption (Watts)
40
Chapter 7: Single-Chip Multiprocessors
41
Core2 (Conroe) Year:2006 Dual Core as standard Quad Core (dual die)
42
Core2 (Wolfdale) Year:2008 45nm shrink 3MB L2 cache per core
43
What next?
44
Single-Threaded Performance (MIPS)
45
Technology (nm)
46
Single-Threaded Performance (OPs/Gate delay)
Doubling every 3 years Pentium 1 is 10x slower than a Core2 if constructed on same tech (over 200x on respective techs)
47
Single-Threaded Performance (offset from doubling every 3 years)
48
Transistors (millions)
Pentium 1 is 250x smaller than a Core2 (Quad)
49
Are you ready to order? Price:£250 Price:£500 Price:£1000 Price:£2000
60% 40%
50
Are you ready to order? Price:£4000 Price:£2000
51
Which would you like?
52
Certainly sir. Would you like any drinks with that?
What drinks? I would like 1000 chicken fajitas please Certainly sir. Would you like any drinks with that?
53
Memory Bandwidth Problems (Guesstimates based on SPEC2006 on a 2GHz Core2)
6GB/s Sequenced 1GB/s Random 40GB/s 20GB/s CPU L1 16KB L2 1MB RAM 2GB 10GB/s 1GB/s 0.3GB/s 0.4GB/s 0.2GB/s 90% hit rate 0.4GB/s 0.3GB/s CPU L1 16KB 10GB/s 1GB/s 0.4GB/s 0.4GB/s CPU L1 16KB 10GB/s 1GB/s CPU L1 16KB 10GB/s 1GB/s 96% Hit rate 97% Hit rate 98% Hit rate
54
Intel White paper: Supra-linear Packet Processing Performance with Intel® Multi-core Processors
Experiment Snort Intrusion detection app. monitoring 175 TCP connections Memory intensive parts Single execution core Throughput: 566 MB/s L2 Cache hit rate: 99% Four execution cores Throughput: 543 MB/s L2 Cache hit rate: 86% Solution: Run only one memory intensive thread at a time
55
Memory Bandwidth Solution (Guesstimates based on SPEC2006 on a 2GHz Core2)
6GB/s Sequenced 1GB/s Random 40GB/s 20GB/s CPU L1 16KB L2 4MB L2 1MB RAM 2GB 10GB/s 1GB/s 0.2GB/s 0.4GB/s 90% hit rate 0.2GB/s 0.4GB/s CPU L1 16KB 10GB/s 1GB/s 0.4GB/s 0.4GB/s L2 4MB CPU L1 16KB 10GB/s 1GB/s 0.2GB/s 0.2GB/s CPU L1 16KB 10GB/s 1GB/s 98% Hit rate 96% Hit rate
56
Core2 L2 Cache variants Dual Core Quad Core Celeron Dual Core: 512KB
Pentium Dual Core: 1MB Core2 Duo (Allendale): 2MB Core2 Duo (Wolfdale-3M): 3MB Core2 Duo (Conroe): 4MB Core2 Duo (Wolfdale): 6MB Quad Core Core2 (Yorkfield-6M): 2x 3MB Core2 (Kentsfield): 2x 4MB Core2 (Yorkfield): 2x 6MB
57
How much cache? 2 core: minimum 0.5MB/core 4 core: minimum 1.5MB/core
Extrapolate to 1000 cores
58
Nonsensical Extrapolations
26MB = 99.9% hit rate?
59
What would you like with your drinks?
60
Performance per Transistor (operations per gate delay per million transistors)
61
UltraSPARC T2 (Niagara)
8 cores/chip 8 threads/core (64 threads) 2 ALUs per core One for each group of 4 threads Up to 16 instructions/cycle 1.2 and 1.4 GHz 4MB L2 cache Four dual-channel FB-DIMM controllers
62
Computation Heavy Benchmark
63
PBzip (somewhat memory intensive)
64
Comparison Sun SPARC Enterprise T5240 Equivalent
2 x UltraSPARC T2 1.2GHz SPECint_rate2006 (127 threads) Peak:134 Base:115 Equivalent 2 x Quad Core Xeon® X GHz SPECint_rate2006 (8 threads) Peak:138 Base:114
65
Price for SPECint_rate 134
Charlie Special 2 x Quad Core Xeon X GHz + 32GB RAM $3,718.00 Dell PowerEdge 2900 III $6,546.00 Apple Mac Pro 2 x 3.2GHz Quad-Core Intel Xeon + 32GB RAM $13,499.00 Sun SPARC Enterprise T5240 Server 2 x UltraSPARC T2 1.2 GHz + 32GB RAM $31,995.00
66
Cancelled Keifer To be released 2010/2011 8 nodes/chip
4 cores/node (32 cores) 4 threads/core (128 Chickens) 3MB cache per node (24MB total) 187KB cache per thread Cancelled
67
Multi-core Challenges
Memory bandwidth Cache collisions Memory heavy threads Taylor™ solution What’s the point of trying improving multi-core performance, it all gonna get worse anyway and we are all going to die… Engineering solution Increase memory bandwidth Reduce cache collisions Create non-memory-intensive algorithms
68
Memory Bandwidth Massive caches Expensive
69
Memory Bandwidth FB-DIMMs Serialised communication Higher latency
Lower single thread performance Gamers hate it Fewer pins
70
Memory Bandwidth Non-Uniform Memory Architecture (NUMA)
Built-in memory controller Hypertransport
71
Chapter 8: Coping with Multi-Core Challenges
72
Nehalem New intel micro-architecture 4 Cores SMT 10%-25% higher IPC
To be released in October this year Commercially called “Core i7” 4 Cores 2 and 8 core versions next year 2, 4 and 6 core versions at 32nm (2010) SMT 2 threads per core 10%-25% higher IPC
73
Nehalem cont. Built-in memory controllers 1366 pins
2 or 3 channel DDR3 1366 pins QuickPath Interconnect Hypertransport equivalent 17GB/s memory bandwidth 3 level caching L1: 32KB instruction 32K data per core L2: 256KB per core Private L3: 2MB per core Shared but “clever” 3MB per core on 8 core model
74
Nehalem
75
Sandy Bridge To be released in 2010
Targeting low power and high speed links 4, 6 and 8 cores SMP 2 threads per core 3 level of caching L1:32K, L2:512K, L3: 2MB (per core) 3MB L3 for the 8 core Dynamic Turbo Exceed TDP Turn off all other cores and run 37% faster for 1 minute GDDR block 64 GB/s (~10 faster than current DDR2) 22nm shrink in 2011
76
Haswell To be released in 2012 Up to 8 cores
Targets cache design and power Few details On chip GPU/Vector units Offload computing onto vector units
77
Chapter 9: Fusion
78
Graphics Processors CPU Vertex Processor Vertex Shader Pixel Shader
Frame Buffer
79
Graphics Processors CPU CPU Vertex Processor Vertex Processor
(X,Y,Z) (X,Y,Z) (X,Y,Z) (X,Y,Z) Vertex Processor Vertex Processor Vertex Shader Vertex Shader Pixel Shader Pixel Shader Frame Buffer
80
Shaders Vertex Shader Vertex Shader Vertex Shader Vertex Shader Vertex
Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader SIMD Code If (!At_Edge) { Draw() } else { Antialiased() }
81
Unified Shaders
82
Unified Shaders
83
GeForce 7800
84
GeForce 8800
85
Larrabee
86
Tegra
87
Conclusions Single thread performance increases have ceased
Each improvement is more expensive and less productive than the last Only way up is going parallel Each generation deals with its current problems Currently memory is “free” but don’t use it Multiprocessor architectures still have challenges The 1000 core machine is both dead and alive Next year 8 cores In 2003 still 8 cores Move towards heterogeneous processors Sub processors not for general computing Small partitioned computation problems only
88
Thank you
89
Answers Where is computer architecture going?
Increase of core numbers, fusion of graphics, physics and general purpose processing Which processors failed and why? Pentium 4: clock got too fast Can I have 1,0000,000 i4004s in a chip? No, the interconnect would be larger than the processors, never mind the cache How fast would a Pentium1 go on current technology? 1.6GHz, 10 times slower than a Core2
90
Answers cont Can I have 32 of those on a chip?
You can, but it will go at the same speed as a 3.2 core Core2 and will need a ton of cache What will I have in my desktop machine in 2013? “Haswell”, 4.5 billion transistor 8.8GHz 8 core with 64MB L3 cache and 256 sub-processors How can I program my GPU? Read the book, its interesting
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.