Download presentation
Presentation is loading. Please wait.
1
1. Evolution of ILP-processing
Dezső Sima Fall 2006 D. Sima, 2006
2
Structure 1. Paradigms of ILP-processing
2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook
3
? x 1. Paradigms of ILP-processing 1.1. Introduction (1)
mainframe 1950 1960 1970 1980 1990 2000 minicomputer microcomputer x UNIVAC 4004 /370 /390 z/900 server/workstation desktop PC value PC 8080 8088 80286 80386 80486 Pentium PII PIII P4 Celeron /360 PDP-8 PDP-11 VAX RS/6000 PPro Xeon super- computer ENIAC CDC-6600 ? Cray-1 Cray-2 NORC Cray-3 Cray T3E Cray-4 Altair Figure 1.1: Evolution of computer classes
4
1.2. ábra: The integer performance of Intel’s x86 line of processors
1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors
5
1.2. Paradigms of ILP-processing (1)
Temporal parallelism Issue parallelism Static dependency resolution Pipeline processors VLIW processors
6
VLIW processing Instructions VLIW: Very Large Instruction Word
F E VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Instructions
7
1.2. Paradigms of ILP processing (1)
Temporal parallelism Issue parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors
8
Superscalar processing
VLIW processing F E VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Superscalar processing F E Dynamic dependency resolution Processor Dependent instructions Instructions
9
1.2. Paradigms of ILP processing (1)
Temporal parallelism Issue parallelism Data parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors SIMD extension
10
1.2. Paradigms of ILP-processing (2)
Issue parallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Pipeline processors. Dynamic dependency resolution Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types
11
1.3. Performance potential of ILP-processors (1)
Ideal case Real case Absolute performance Sequential Pipeline VLIW/ superscalar SIMD extension
12
1.3. ILP processzorok teljesítménypotenciálja (2)
Performance components of ILP-processors: Clock frequency Temporal parall. Issue Data Efficiency of spec. exec. with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application
13
Types of temporal parallelism in ILP processors
2. Introduction of temporal parallelism 2.1. Introduction (1) Types of temporal parallelism in ILP processors Overlapping all phases i +1 +3 +2 F E W D Pipeline processors Atlas (1963) 37 IBM 360/91 (1967) 38 R (1988) 41 i (1985) 42 M (1988) 43 (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism
14
2.1. Introduction (2) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 Pipeline (scalar) processors R2000 68020 80286 Figure 2.2: The appearance of pipeline processors
15
2.2. Processing bottlenecks evoked and their resolution
Overview The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)
16
2.2.2. The scarcity of memory bandwidth (1)
Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth
17
2.2.2. The scarcity of memory bandwidth (2)
x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4) C(4,4) C(16) C(8,8) C(0,1/4) R2000 68020 80286 Pipeline (scalar) processors with cache(s) C(n) Universal cache (size in kB) C(n/m) Instruction/data cache (sizes in kB) Pipeline (scalar) processors without cache(s) Figure 2.3: Introduction of caches
18
2.2.3. The problem of branch processing (1)
(E.g. in case of conditional branches) clock cycles bc ii F D E W Conditional branch bc ii+1 F D E ii+2 F D bti ii+4 Decode F Branch target instruction bti Condition checking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline
19
2.2.3. The problem of branch processing (2)
Conditional branches Instructions other than conditional branches Guessed path Basic block Approved path Figure 2. 5: Principle of branch prediction in case of a conditional branch
20
2.2.3. The problem of branch processing (3)
x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4) C(4,4) C(16) C(8,8) (Scalar) pipeline processors Speculative execution of branches C(0,1/4) R2000 68020 80286 Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors
21
2.3. Generations of pipeline processors (1)
Cache Speculative branch processing generation pipelined no no 1.5. generation pipelined yes no generation pipelined yes yes
22
2.3. Generations of pipeline processors (2)
x86 80286 80386 80486 C(0,1/4) C(1/4,1/4) C(4,4) M68000 68020 68030 68040 C(4,4) C(4,4) C(16) C(8,8) MIPS R R2000 R3000 R6000 R4000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 generation pipelined (no cache, no speculative branch processing) 1.5. generation pipelined (cache, no speculative branch processing) generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors
23
2.4. Exhausting the available temporal parallelism
2. generation pipeline processors already exhaust the available temporal parallelism
24
3. Introduction of issue parallelism
3.1. Options to implement issue parallelism VLIW (EPIC) instruction issue Static dependency resolution (3.2) Pipeline processing Superscalar instruction issue Dynamic dependency resolution (3.3)
25
3.2. VLIW processing (1) Figure 3.1: Principle of VLIW processing
Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) E U E U E U E U Figure 3.1: Principle of VLIW processing
26
Static dependency resulution with parallel optimization
3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler
27
3.2. VLIW processing (3) The term ‘VLIW’ Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997
28
Benefits of static dependecy resolution:
3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP
29
3.2. VLIW processing (5) Drawbacks of static dependency resolution:
Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions
30
3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth
31
3.2. VLIW processing (7) Commercial VLIW processors:
Trace (1987) Multiflow Cydra (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors
32
3.2. VLIW processing (8) VLIW EPIC
Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64 Itanium
33
3.3. Superscalar processing
Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA
34
3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors
Source: Sima et al., ACA, Addison-Wesley, 1997
35
Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
36
3.3.2. Attributes of first generation superscalars (1)
2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” Width: Core: Static branch prediction Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: Examples: Alpha 21064 PA 7100 Pentium
37
3.3.2. Attributes of first generation superscalars (2)
Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997
38
3.3.2. Attributes of first generation superscalars (3)
Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU
39
3.3.3. The bottleneck evoked and its resolution (1)
The issue bottleneck (a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process Figure 3.5: The principle of direct issue
40
3.3.3. The bottleneck evoked and its resolution (2)
Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue
41
3.3.3. The bottleneck evoked and its resolution (3)
First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core
42
3.3.4. Attributes of second generation superscalars (1)
First generation ”narrow” superscalars Second generation ”wide” superscalars Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” 4 RISC instructions/cycles or 3 CISC instruction/cycle „wide” Static branch prediction Buffered (ooo) issue Predecoding Dynamic branch prediction Register renaming ROB Core: Caches: Single-ported, blocking L1 data caches Off-chip L2 caches attached via the processor bus Dual-ported, non-blocking L1 data caches direct attached off-chip L2 caches Examples: Alpha 21064 Alpha 21264 PA 7100 PA 8000 Pentium Pentium Pro K6
43
3.3.4. Attributes of second generation superscalars (2)
Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997
44
Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue
Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990
45
3.3.4. Attributes of second generation superscalars (3)
Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU
46
3.3.5. Exhausting the issue parallelism
In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level
47
4. Introduction of data parallelism
4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism
48
4.1. Overview (2) SIMD instructions (FX/FP)
Superscalar extension SIMD instructions (FX/FP) Multiple operations within a single instruction Superscalar issue EPIC extension Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors
49
4.2. The appeareance of SIMD instructions in superscalars (1)
Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars
50
A 2.5. and 3. generation superscalars (1)
Second generation superscalars FX SIMD (MM) 3. generation superscalars FX SIMD + FP SIMD (MM+3D)
51
2.5. and 3. generation superscalars (2)
Figure 4.4: The emergence of 2.5. and 3. generation superscalars
52
by third generation superscalars
Bottlenecks evoked by third generation superscalars System architecture (memory, display) On-chip L2 AGP bus
53
4.3. Overview of superscalar processor generations
Superscalars First Generation Second Generation 2.5 Generation 2-3 RISC instructions/cycle or 4 RISC instructions/cycle or Unbuffered issue No renaming Single ported data caches Static branch prediction Off-chip L2 caches attached via the processor bus No MM/3D support Examples: Alpha 21064 PA 7100 PowerPC 601 SuperSparc Pentium Alpha 21264 PA 8000 PowerPC 604 UltraSparc I, II Pentium Pro PowerPC 620 Pentium II Buffered issue (shelving) Renaming Predecoding Dual ported data caches Dynamic branch prediction Off-chip direct coupled L2 caches K6 FX-SIMD 1 1,4 2 No ROB ROB Blocking L1 data caches or nonblocking caches with up to a single pending cache miss allowed Nonblocking L1 data caches with multiple cache misses allowed Features: ("Thin superscalars") ("Wide superscalars") ("Wide superscalars with MM/3D support") 2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide" No predecoding No renaming. Power2 3 4 Dual ported data cache, optional No off-chip direct coupled L2. Only single ported data cache. Width: Core: Caches: ISA: instructions Third Generation Power 4 Pentium III m ( ) Pentium 4 Athlon (model 4) On-chip L2 caches FX- and FP-SIMD Athlon MP (model 6) dynamic branch prediction. Performance Complexity Memory Bandwidth Branch prediction accuracy
54
4.4. Exhausting the performance potential of data parallelism
In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well. Thus the era of ILP-processors came to an end.
55
4.5. The introduction of SIMD instructions in EPIC (VLIW) processors
VLIW architectures/processors did not support SIMD instructions EPIC architektectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)
56
5. Summing up the main road of evolution
5.1. Main evolution scenarios a. Evolutionary scenario (Superscalar approach) (The main road) Introduction and increase of issue parallelism Introduction and increase of data parallelism Introduction and increase of temporal parallelism b. Radical scenario (VLIW/EPIC approach) Introduction of VLIW processing Introduction of data parallelism (EPIC)
57
5.2. Main road of processor evolution (1)
Extent of Sequential opereration level ILP processing parallelism Temporal parallelism + Issue parallelism + Data parallelism Traditional von N. procs. Pipeline processors Superscalar processors Superscalar processors with SIMD extension t ~ 1985/88 ~ 1990/93 ~ 1994/00 Level of hardware redundancy Figure 5.1: The three cycles of the main road of processor evolution
58
5.2. The main road of evolution (2)
i: Introduction of temporal issue and data parallelism introduction of a particular dimension of parallelism i=1:3 processing bottleneck(s) arise elimination of the bottleneck(s) evoked by introducing appropriate techniques as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism Figure 5.2: Three main cycles of the main road
59
5.2. Main road of the evolution (3)
Traditional sequential processing Introduction of temporal parallelism Introduction of Introduction of issue parallelism Introduction of data parallelism Traditional sequential procesors Pipeline processors Superscalar Superscalars with processors SIMD extension Advanced memory subsystem ISA extension 1. generation 1.5. generation Caches generation Branch prediction Advanced branch processing 2.5. generation FX SIMD extension Extension of system architecture AGP On-chip L 3. generation FP SIMD extension ... 1. generation 2. generation Dynamic inst. scheduling Renaming Predecoding Dynamic branch prediction ROB Dual ported data caches Nonblocking L1 data caches with multiple cache misses allowed Off-chip direct coupled L2 caches ~ 1985/88 ~ 1990/93 ~ 1994/97 Figure 5.3: New techniques introduced in the three main cycles of processor evolution
60
5.2. Main road of evolution (4)
Memory Bandwidth Hardware Complexity Performance ~ 1985 ~ 2000 t Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance
61
5.2. Main road of evolution (5)
Accuracy of branch prediction Number of pipeline stages fc ~ 1985 ~ 2000 t Figure 5.5: Branch prediction accuracy vs raising clock rates
62
Granularity of parallelism
6. Outlook: the introduction of thread level parallelism (1) Granularity of parallelism Multiple threds TP (thread-level parallelism) Thread (instruction flow) ILP (instruction-level parallelism)
63
Multitasking, Multithreading
6. Outlook: the introduction of thread level parallelism (2) Where multiple threads can come from? from different applications from the same applications Multiprogramming Multitasking, Multithreading
64
6. Outlook: the introduction of thread level parallelism (3)
Implementation of thread level parallelism in microprocessors SMP: Symmetric Multiprocessing (CMP: Chip Multiprocessing) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) implementation by two or more cores placed on the same chip implementation by a multithreaded core L3/Memory SMT core L2/L3 L3/Memory L2/L3 Core Chip
65
6. Outlook: the introduction of thread level parallelism (4)
SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) (Four-way) superscalar Multithreaded superscalar (four-way/two threads) Thread Thread 2 Thread 1
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.