Presentation is loading. Please wait.

Presentation is loading. Please wait.

Charlie Brej APT Group University of Manchester

Similar presentations


Presentation on theme: "Charlie Brej APT Group University of Manchester"— Presentation transcript:

1 Charlie Brej APT Group University of Manchester
Group Talk Charlie Brej APT Group University of Manchester

2 I Computers

3 Never Obsolete?

4 Want

5 Cool Forever

6 This Talk History of microprocessors The 1000 chicken computer
What did they do and why did they do it Nostalgia The 1000 chicken computer What’s wrong with the idea Benchmarks of Artois Current challenges Future designs New micro architectures Where are we heading? 88 slides

7 Single-Threaded Performance (MIPS)
Sandy Bridge Core 2 Pentium 4 Nehelem Pentium MMX Pentium III Pentium II Pentium Pentium Pro i486 i386 i286 8086 8080 4004

8 Chapter 1: Make it possible

9 4004: Computer on a chip!

10 4004 Year:1971 2,300 transistors 46 instructions
16 registers of 4 bits each 4 KB memory access

11 1974: Memory was expensive * $6, today’s money

12 8080 Year:1974 78 instructions 7 registers 8-bit each Stack operations
64 KB memory access

13 Jim could do better in software but using up RAM
8086 Year:1978 8 registers 16-bit each 1MB memory access (segmented) Many instuctions but multi cycle: JMP 15 cycles MUL cycles Jim could do better in software but using up RAM

14 Chapter 2: Make it useful

15 i286 Year:1982 Protected mode Linear MMU (Memory Protection Unit)
‘Brain dead chip’ (Bill Gates)

16 i386 Year:1986 32-bit architecture (we still use today)
Virtual memory MMU Performance becomes important Real MUL/DIV, on chip float point

17 Chapter 3: Decrease CPI (Clocks Per Instruction)

18

19 i486 Year:1989 Pipelining (5 stage) Single cycle instructions
~4 times faster than an i386 Clock multiplier 2x and 3x 8 KB on-chip cache

20 Chapter 4: Increase IPC (Instructions Per Clock)

21 Pentium (P54C, P54CS and MMX)
Year:1993 Superscalar (2 way) MMX (basic SIMD ops) 6 stage pipeline

22 Instructions Per Clock
Pentium MMX with double L1 cache Clock multiplied Pentium

23 Chapter 5: Improve Caches

24 Pentium Pro Year:1995 Speculative OOO execution Ten stage pipeline
On-package L2 cache

25 Pentium Pro Versions 256KB 1024KB 512KB

26 Pentium II Year:1997 Off-chip L2: PII: 512KB half speed
Xeon: 2MB full speed

27 Pentium III (Katmai) Year:1999 (Aug) Short life

28 Pentium III (Coppermine)
Year:1999 (Oct) On chip 256K L2 Or 2MB for servers

29 Pentium III (Tualatin)
Year:2001

30 Pentium M (aka Core) Year:2004 Cache transistors: 66%

31 P6 Range stats Pentium Pro Pentium II Pentium III (Katmai)
Pentium III (Coppermine) Pentium III (Tualatin) Pentium M (Core) Released 11/1995 04/1998 02/1999 10/1999 07/2001 04/2004 Tech 350nm 250nm 180nm 130nm 90nm Transistors 5.5M 7.5M 9.5M 28.1M 44M 140M L1 Cache 16KB 32KB 64KB L2 Cache 256KB 512KB 2048KB Max. Clock 200MHz 450MHz 600MHz 1000MHz 1400MHz 2333MHz FSB speed 66MHz 100MHz 133MHz 667MHz* Clock Mul 3x 4.5x 6x 7.5x 10.5x 13x Cache trans. 18% 27% 21% 46% 55% 66%

32 Cache transistors (%)

33 Chapter 6: Increase Clock Speed

34 Pentium 4 (Willamette) Year:2000
Super-pipelined: 31 stage pipeline (!)

35 Pentium 4 (Northwood) Year:2002 Hyper-Threading

36 Pentium 4 (Prescott) Year:2004 Never reached its potential

37 PIII (Katmai) and P4 (Willamette) Insufficient time for tech ramp
Gate Delays per Clock Pentium: 40 Pentium III onward: 15.2 PIII (Katmai) and P4 (Willamette) Insufficient time for tech ramp Northwood: 6.5 Prescott: 8.7 (too hot)

38 Power consumption (Watts)

39

40 Chapter 7: Single-Chip Multiprocessors

41 Core2 (Conroe) Year:2006 Dual Core as standard Quad Core (dual die)

42 Core2 (Wolfdale) Year:2008 45nm shrink 3MB L2 cache per core

43 What next?

44 Single-Threaded Performance (MIPS)

45 Technology (nm)

46 Single-Threaded Performance (OPs/Gate delay)
Doubling every 3 years Pentium 1 is 10x slower than a Core2 if constructed on same tech (over 200x on respective techs)

47 Single-Threaded Performance (offset from doubling every 3 years)

48 Transistors (millions)
Pentium 1 is 250x smaller than a Core2 (Quad)

49 Are you ready to order? Price:£250 Price:£500 Price:£1000 Price:£2000
60% 40%

50 Are you ready to order? Price:£4000 Price:£2000

51 Which would you like?

52 Certainly sir. Would you like any drinks with that?
What drinks? I would like 1000 chicken fajitas please Certainly sir. Would you like any drinks with that?

53 Memory Bandwidth Problems (Guesstimates based on SPEC2006 on a 2GHz Core2)
6GB/s Sequenced 1GB/s Random 40GB/s 20GB/s CPU L1 16KB L2 1MB RAM 2GB 10GB/s 1GB/s 0.3GB/s 0.4GB/s 0.2GB/s 90% hit rate 0.4GB/s 0.3GB/s CPU L1 16KB 10GB/s 1GB/s 0.4GB/s 0.4GB/s CPU L1 16KB 10GB/s 1GB/s CPU L1 16KB 10GB/s 1GB/s 96% Hit rate 97% Hit rate 98% Hit rate

54 Intel White paper: Supra-linear Packet Processing Performance with Intel® Multi-core Processors
Experiment Snort Intrusion detection app. monitoring 175 TCP connections Memory intensive parts Single execution core Throughput: 566 MB/s L2 Cache hit rate: 99% Four execution cores Throughput: 543 MB/s L2 Cache hit rate: 86% Solution: Run only one memory intensive thread at a time

55 Memory Bandwidth Solution (Guesstimates based on SPEC2006 on a 2GHz Core2)
6GB/s Sequenced 1GB/s Random 40GB/s 20GB/s CPU L1 16KB L2 4MB L2 1MB RAM 2GB 10GB/s 1GB/s 0.2GB/s 0.4GB/s 90% hit rate 0.2GB/s 0.4GB/s CPU L1 16KB 10GB/s 1GB/s 0.4GB/s 0.4GB/s L2 4MB CPU L1 16KB 10GB/s 1GB/s 0.2GB/s 0.2GB/s CPU L1 16KB 10GB/s 1GB/s 98% Hit rate 96% Hit rate

56 Core2 L2 Cache variants Dual Core Quad Core Celeron Dual Core: 512KB
Pentium Dual Core: 1MB Core2 Duo (Allendale): 2MB Core2 Duo (Wolfdale-3M): 3MB Core2 Duo (Conroe): 4MB Core2 Duo (Wolfdale): 6MB Quad Core Core2 (Yorkfield-6M): 2x 3MB Core2 (Kentsfield): 2x 4MB Core2 (Yorkfield): 2x 6MB

57 How much cache? 2 core: minimum 0.5MB/core 4 core: minimum 1.5MB/core
Extrapolate to 1000 cores

58 Nonsensical Extrapolations
26MB = 99.9% hit rate?

59 What would you like with your drinks?

60 Performance per Transistor (operations per gate delay per million transistors)

61 UltraSPARC T2 (Niagara)
8 cores/chip 8 threads/core (64 threads) 2 ALUs per core One for each group of 4 threads Up to 16 instructions/cycle 1.2 and 1.4 GHz 4MB L2 cache Four dual-channel FB-DIMM controllers

62 Computation Heavy Benchmark

63 PBzip (somewhat memory intensive)

64 Comparison Sun SPARC Enterprise T5240 Equivalent
2 x UltraSPARC T2 1.2GHz SPECint_rate2006 (127 threads) Peak:134 Base:115 Equivalent 2 x Quad Core Xeon® X GHz SPECint_rate2006 (8 threads) Peak:138 Base:114

65 Price for SPECint_rate 134
Charlie Special 2 x Quad Core Xeon X GHz + 32GB RAM $3,718.00 Dell PowerEdge 2900 III $6,546.00 Apple Mac Pro 2 x 3.2GHz Quad-Core Intel Xeon + 32GB RAM $13,499.00 Sun SPARC Enterprise T5240 Server 2 x UltraSPARC T2 1.2 GHz + 32GB RAM $31,995.00

66 Cancelled Keifer To be released 2010/2011 8 nodes/chip
4 cores/node (32 cores) 4 threads/core (128 Chickens) 3MB cache per node (24MB total) 187KB cache per thread Cancelled

67 Multi-core Challenges
Memory bandwidth Cache collisions Memory heavy threads Taylor™ solution What’s the point of trying improving multi-core performance, it all gonna get worse anyway and we are all going to die… Engineering solution Increase memory bandwidth Reduce cache collisions Create non-memory-intensive algorithms

68 Memory Bandwidth Massive caches Expensive

69 Memory Bandwidth FB-DIMMs Serialised communication Higher latency
Lower single thread performance Gamers hate it Fewer pins

70 Memory Bandwidth Non-Uniform Memory Architecture (NUMA)
Built-in memory controller Hypertransport

71 Chapter 8: Coping with Multi-Core Challenges

72 Nehalem New intel micro-architecture 4 Cores SMT 10%-25% higher IPC
To be released in October this year Commercially called “Core i7” 4 Cores 2 and 8 core versions next year 2, 4 and 6 core versions at 32nm (2010) SMT 2 threads per core 10%-25% higher IPC

73 Nehalem cont. Built-in memory controllers 1366 pins
2 or 3 channel DDR3 1366 pins QuickPath Interconnect Hypertransport equivalent 17GB/s memory bandwidth 3 level caching L1: 32KB instruction 32K data per core L2: 256KB per core Private L3: 2MB per core Shared but “clever” 3MB per core on 8 core model

74 Nehalem

75 Sandy Bridge To be released in 2010
Targeting low power and high speed links 4, 6 and 8 cores SMP 2 threads per core 3 level of caching L1:32K, L2:512K, L3: 2MB (per core) 3MB L3 for the 8 core Dynamic Turbo Exceed TDP Turn off all other cores and run 37% faster for 1 minute GDDR block 64 GB/s (~10 faster than current DDR2) 22nm shrink in 2011

76 Haswell To be released in 2012 Up to 8 cores
Targets cache design and power Few details On chip GPU/Vector units Offload computing onto vector units

77 Chapter 9: Fusion

78 Graphics Processors CPU Vertex Processor Vertex Shader Pixel Shader
Frame Buffer

79 Graphics Processors CPU CPU Vertex Processor Vertex Processor
(X,Y,Z) (X,Y,Z) (X,Y,Z) (X,Y,Z) Vertex Processor Vertex Processor Vertex Shader Vertex Shader Pixel Shader Pixel Shader Frame Buffer

80 Shaders Vertex Shader Vertex Shader Vertex Shader Vertex Shader Vertex
Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader Pixel Shader SIMD Code If (!At_Edge) { Draw() } else { Antialiased() }

81 Unified Shaders

82 Unified Shaders

83 GeForce 7800

84 GeForce 8800

85 Larrabee

86 Tegra

87 Conclusions Single thread performance increases have ceased
Each improvement is more expensive and less productive than the last Only way up is going parallel Each generation deals with its current problems Currently memory is “free” but don’t use it Multiprocessor architectures still have challenges The 1000 core machine is both dead and alive Next year 8 cores In 2003 still 8 cores Move towards heterogeneous processors Sub processors not for general computing Small partitioned computation problems only

88 Thank you

89 Answers Where is computer architecture going?
Increase of core numbers, fusion of graphics, physics and general purpose processing Which processors failed and why? Pentium 4: clock got too fast Can I have 1,0000,000 i4004s in a chip? No, the interconnect would be larger than the processors, never mind the cache How fast would a Pentium1 go on current technology? 1.6GHz, 10 times slower than a Core2

90 Answers cont Can I have 32 of those on a chip?
You can, but it will go at the same speed as a 3.2 core Core2 and will need a ton of cache What will I have in my desktop machine in 2013? “Haswell”, 4.5 billion transistor 8.8GHz 8 core with 64MB L3 cache and 256 sub-processors How can I program my GPU? Read the book, its interesting


Download ppt "Charlie Brej APT Group University of Manchester"

Similar presentations


Ads by Google