Download presentation
Presentation is loading. Please wait.
1
A New Era in Processor Evolution
Dezső Sima Spring 2008 (Ver. 2.2) Dezső Sima, 2008
2
Foreword Beginning with second generation superscalars, the continuous, approximately 10-fold-per-decade increase of processor efficiency leveled off for reasons shown in Chapter I. Designers responded by massively rising clock frequencies at up to a 100-fold-per-decade rate in order to sustain an approximately 100-fold-per-decade performance increase. Such a rapid progress, however inevitably encountered its limits due to declining processor efficiency, increasing dissipation and skew in parallel buses, as shown in this Chapter. As a consequence, a decade long era of processor evolution, characterized by massively rising clock frequencies, ended in the last few years. The new era is heralded by multicore and multithreaded designs, as discussed in Chapters III. and IV.
3
Contents 1. Processor performance 2. Efficiency of processors
3. Addressing the levelling off of processor efficiency 4. Aggressively raising clock frequency 5. The efficiency wall 6. The thermal wall 7. The skew wall 8. EPIC architectures/processors 9. The end of an era in processor evolution
4
1. Processor Performance
5
E.g.: SPECint92, SPECint_base2000
1.1. Introduction (1) Absolute performance Relative performance Number of succesfully executed instructions/sec Relating the execution times of a benchmark program on the tested system to a reference system according to the following interpretation: Number of succesfully executed operations/sec (SIMD) fc: Clock frequency IPC: Instructions/cycle OPI: Operations/cycle E.g.: SPECint92, SPECint_base2000
6
In general purpose applications:
1.1. Introduction (2) In general purpose applications: where: IPC : issued instructions per cycle η : number of successfully executed/issued instructions (efficiency of the speculative execution)
7
Practical measurement: Pr
1.1. Introduction (3) In performance/efficiency studies: Theoretical interpretation: Pa Practical measurement: Pr ?
8
I: Number of instructions in the application considered
1.1. Introduction (4) If the following were true: In that case: I: Number of instructions in the application considered
9
1.1. Introduction (5) However:
Figure 1.1.: Runtime ratios of the component programs of SPECint2000 Source:
10
1.1. Introduction (6) When comparing the performance of two systems:
This estimation is useable in trend considerations.
11
Comparing the efficiency of two systems:
1.1. Introduction (7) Comparing the efficiency of two systems:
12
1.2. Evolution of processor performance (1)
SPECint92 5 10 50 Year 86 88 79 1980 81 82 83 84 85 87 89 1990 91 92 93 94 95 96 97 98 99 * 2 386/16 8088/5 0.5 100 8088/8 80286/10 80286/12 386/20 386/25 386/33 500 1000 20 200 1 0.2 486/25 486/33 486/50 486-DX2/66 Pentium/66 Pentium/100 Pentium/120 Pentium Pro/200 PII/450 PIII/600 486-DX4/100 Pentium/133 Pentium/166 Pentium/200 PII/300 PII/400 PIII/500 486-DX2/50 2000 01 02 03 5000 PIII/1000 P4/1500 P4/1700 P4/2000 P4/2200 P4/2400 P4/2800 P4/3060 P4/3200 ~ 100*/10 years 04 05 Northwood B 10000 Prescott (1M) Prescott (2M) Levelling off Figure 1.2: Integer performance growth of Intel’s x86 processors
13
1.2. Evolution of processor performance (2)
Figure 1.3: Integer performance growth (in general - 1) Source: X86-64 Technology White Paper, AMD Inc., Sunnyvale, CA, 2000
14
1.2. Evolution of processor performance (3)
3. Figure 1.4: Integer performance growth (in general - 2) Source: F. Labonte, www-vlsi.stanford.edu/group/chart/specInf2000.pdf
15
2. Efficiency of processors
16
2.1. Introduction ?
17
2.2. Growth of processor efficiency (1)
SPECint_base2000/ Year 79 1980 81 82 83 84 85 86 87 88 89 1990 91 92 93 94 95 96 97 98 99 78 2000 01 02 0.05 0.1 0.02 0.5 1 0.2 0.01 ~ * Pentium 486DX 386DX 286 Pentium II Pentium Pro Pentium III ~10*/10 years Levelling off 2. generation superscalars Figure 2.1: Efficiency of Intel processors
18
2.2. Growth of processor efficiency (2)
Figure 2.2: Growth of processor performance/efficiency (in general) Source: J. Birnbaum, „Architecture at HP: Two decades of Innovation”, Microprocessor Forum, October 14, 1997.
19
2.3. Contribution of raising processor efficiency to the growth of processor performance (up to the 2nd generation of superscalars) ? A második generációig az órafrekvencia és a hatékonyság növelése egyenlő arányban járultak hozzá a teljesítmény növeléséhez.
20
2.4. Sources of raising processor efficiency
Increasing the word length 8/16 32 bit (286 386DX) Introducing and increasing temporal parallelism 1st and 2nd generation pipeline processors (386DX, 486DX) Introducing and increasing issue parallelism 1st and 2nd generation superscalars (Pentium, Pentium Pro)
21
2.5. Limit of raising processor efficiency (1)
2nd generation superscalars (wide superscalars) Processing width 4 RISC instructions/cycle ~3 CISC instructions/cycle Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990 Figure 2.3: Processing width of 2nd generation (wide) superscalars vs extent of parallelism available in general purpose applications
22
2.5. Limit of raising processor efficiency (2)
Figure 2.4: Growth of processor efficiency (in general)
23
2.5. Limit of raising processor efficiency (3)
In general purpose applications: The width of 2nd generation superscalars already approaches the extent of available parallelism (ILP) Beginning with 2nd generation (wide) superscalars the sources of extensively raising processor efficiency became exhausted
24
3. Addressing the levelling off of processor efficiency
25
Aggresively raising clock frequency
Essentially widening the core by introducing EPIC architectures (Sections 4 – 7) (Section 8) Main road of evolution
26
4. Aggressively raising clock frequency
27
4.1. Sources of raising clock frequencies (1)
Raising clock frequency By scaling down the feature size in the manufacturing process By reducing the logic depth of pipline stages
28
4.1. Sources of raising clock frequencies (2)
Figure 4.1: Evolution of Intel’s process technology Source: D. Bhandarkar: „The Dawn of a New Era”, 11. EMEA, May, 2006.
29
4.1. Sources of raising clock frequencies (3)
20 30 Year * 10 40 1990 2000 Pentium (5) 2005 No of pipeline stages Pentium Pro (~12) Pentium 4 (~20) Athlon-64 (12) P4 Prescott (~30) (14) Conroe Athlon (6) K6 1995 Core Duo Figure 4.2: Number of pipeline stages in Intel’s and AMD’s processors
30
4.1. Sources of raising clock frequencies (4)
Figure 4.3: Max. logic depth of pipeline stages in processors (in terms of FO4) Source: F. Labonte www-vlsi.stanford.edu/group/chart/CycleFO4.pdf
31
4.2. Growth rate of clock frequencies (1)
Figure 4.4: Growth of clock frequencies in Intel’s x86 line of processors
32
4.2. Growth rate of clock frequencies (2)
Figure 4.5: Growth of clock frequencies (in general)
33
4.3. Implications of aggressively raising clock frequencies
4.3.1 Overview Ousting of major RISC families (4.3.2) Emerging limits of evolution (4.3.3)
34
4.3.2. Ousting of major RISC families (2)
Figure 4.6: The shift in performance leadership between RISC and x86 lines
35
4.3.2. Ousting of major RISC families (2)
: CISCs overtook the performance leadership then it is a more intrinsic task to raise fc from a higher value than from a lower one in the same rate 1997: Intel and HP unveiled IA-64/Merced as the next generation architecture/processor line Cancelling of most major RISC lines, such as MIPS’s R-Lines, HP’s Alpha and PA lines, PowerPC Consortium’s PowerPC line
36
4.3.3. Emerging limits of evolution
The efficiency wall (Section 5) The thermal wall (Section 6) The skew wall (Section 7)
37
5. The efficiency wall
38
5.1. Overview Basic reason:
speed gap between the processor and the memory (it widens on higher frequencies)
39
Transfer rates of processor buses
5.1. Overview (2) Main appearances of the speed gap between the processor and the memory: DRAM latencies Memory transfer rates L2 cache latencies Transfer rates of processor buses
40
5.2. Speed gap between processor and memory (1a)
Figure 5.1a: DRAM types
41
5.2. Speed gap between processor and memory (1b)
Figure 5.1b: Latency of DRAM chips
42
5.2. Speed gap between processor and memory (1c)
Figure 5.1c: System-level memory latency in x86-based PCs
43
5.2. Speed gap between processor and memory (1d)
Figure 5.1d: Latency of DRAM chips (in clock cycles)
44
5.2. Speed gap between processor and memory (2)
Figure 5.2: Relative transfer rate of memories (D: dual channel)
45
5.2. Speed gap between processor and memory (3)
fc max at intro. (GHz) L2 size (Kbyte) L2 latency (clock cycles) Willamette 1.5 128 7 Northwood 2.0 512 16 Prescott 3.4 1024 23 Figure 5.3: Latency of L2 caches
46
5.2. Speed gap between processor and memory (4)
Figure 5.4: Relative transfer rates of processor buses
47
5.3. Efficiency of 3rd generation superscalars (1)
5.5: Efficiency of Intel’s Pentium III and Pentium 4 processors in general purpose applications
48
5.3. Efficiency of 3rd generation superscalars (2)
Figure 5.6: efficiency of AMD’s Athlon, Athlon XP and Athlon 64 processors in general purpose applications
49
5.3. Efficiency of 3rd generation superscalars (3)
Figure 5.7: Main aspects of the memory subsystem affecting core efficiency
50
5.3. Efficiency of 3rd generation superscalars (4)
Figure 5.8: Contrasting the efficiency of Intel’s and AMD’s processors
51
5.3. Efficiency of 3rd generation superscalars (5)
Figure 5.9: Contrasting Intel’s and AMD’s processor design philosophies
52
5.3. Efficiency of 3rd generation superscalars (6)
Implication of the emerging efficiency wall: Diminishing return on higher clock frequencies
53
6. The thermal wall
54
6. The thermal wall (1) Dd=A*C*V2*fc Ds=V*Ileak Dissipation (D) :
Dynamic Static Dd=A*C*V2*fc Ds=V*Ileak with A: ratio of the active gates C: effective capacity of the gates V: supply voltage fc: clock frequency Ileak: leakage current
55
Figure 6.1:Chip dynamic and static power dissipation trends
6. The thermal wall (2) Figure 6.1:Chip dynamic and static power dissipation trends Source: N. S. Kim et al., „Leakage Current: Moore’s Law Meets Static Power”, Computer, Dec. 2003, pp
56
Figure 6.2: Dynamic and static power dissipation trends
Source:Solie D., „Technology Trends, Aug. 2006, file14+-+darryl+solie+-+ibm+power+symposium+presentation/$file/14+-+darryl+solie-ibm-power+symposium+presentation+v2.pdf
57
Figure 6.3: Relative dissipation of Intel’s x86 family of processors
6. The thermal wall (3) Figure 6.3: Relative dissipation of Intel’s x86 family of processors
58
6. The thermal wall (4) Figure 6.4: Contrasting the evolution of Intel’s and AMD’s processor lines with the thermal wall
59
Figure 6.5: Intel’s P4 processor family (Netburst architecture)
6. The thermal wall (5) Figure 6.5: Intel’s P4 processor family (Netburst architecture)
60
6. The thermal wall (6) Figure 6.6: The growth of relative dissipation of processors (in general) Source: R Hetherington, „The UltraSPARC T1 Processor” White Paper, Sun Inc., 2005
61
6. The thermal wall (7) Implications of the thermal wall:
The approach to increase performance by aggressively raising clock frequency met the thermal wall Processor designs focus now more and more on power aware technics
62
7. The skew wall
63
Figure 7.1: Skew between lines of parallel buses
7. The skew wall (1) Reason: Figure 7.1: Skew between lines of parallel buses
64
7. The skew wall (2) Figure 7.2: Equalizing skews among different bit lines of the processor bus on the MSI 915G Combo motherboard
65
7. The skew wall (3) Implication of emerging skews between bit lines of parallel buses: Introducing sequential buses (also in slow peripheral buses due to impressive cost savings) Figure 7.3: Signal transfer over a sequential bus
66
Implication of emerging limits of evolution
The approach to aggressively raise clock frequencies met the efficiency, thermal and skew walls and thus hit the dead end
67
8. EPIC architectures/processors
68
8. EPIC architectures/processors (1)
Aggresively raising clock frequency Essentially widening the core by introducing EPIC architectures (Sections 4 – 7) (Section 8) Main road of evolution
69
8. EPIC architectures/processors (2)
Instructions Principle of superscalar processing F E dynamic dependency resolution Processor dependent instructions Principle of VLIW processing F E VLIW: Very Large Instruction Word independent instructions (static dependency resolution) Processor Figure 8.1: Contrasting the principles of operation of superscalar and VLIW processors
70
8. EPIC architectures/processors (3)
VLIW EPIC EPIC: Explicitly Parallel Instruction Computer enhanced VLIW (integration of advanced superscalar features) branch prediction explicit cache control 1994: Intel, HP 1997:EPIC designation 2001: IA-64 Itanium
71
8. EPIC architectures/processors (4)
Figure 8.2: Overview of Itanium cores
72
8. EPIC architectures/processors (5)
Figure 8.3: The efficiency of Itanium processors
73
8. EPIC architectures/processors (6)
Figure 8.4: Expected spreading of the IA-64 architecture (Itanium processors) Source: L. Gwennap: Intel’s Itanium and IA-64: Technology and Market Forecast, MDR, 2000
74
8. EPIC architectures/processors (7)
Figure 8.5: Revenue expectations concerning Intel’s Itanium line
75
8. EPIC architectures/processors (8)
In general purpose applications: EPIC architectures/processors play a decreasing role
76
9. The end of an era in processor evolution
77
9. The end of an era in processor evolution (1)
In general purpose applications beginning with the 2. generation superscalars processor efficiency leveled off, but both approaches to address leveling off efficiency met limits of evolution and thus hit the dead end Single core complex superscalars, – at the end of an era
78
9. The end of an era in processor evolution (2)
Available hardware complexity increases further on exponentially (Moore’s law) Complexity is doubled in each ~ 24 moths A new era in processor evolution – The dawn of multicore, multithreded processors The number of processors will double also in each ~ 24 months
79
9. The end of an era in processor evolution (3)
Figure 9.1: Rapid spreading of multi core processors revealed by Intel
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.