Download presentation
Presentation is loading. Please wait.
Published byLynne Butler Modified over 9 years ago
1
Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0) Dezső Sima, 2007
2
Foreword The steady demand for higher processor performance has provoked the successive introduction of temporal, issue and intra-instruction parallelism into processor operation. Consequently, traditional sequential processors, pipelined processors, superscalar processors and superscalar processors with multimedia and 3D support mark subsequent evolutionary phases of microprocessors. On the other hand the introduction of each basic technique mentioned gave rise to specific system bottlenecks whose resolution called for innovative new techniques. Thus, the emergence of pipelined instruction processing stimulated the introduction of caches and of speculative branch processing. The debut of superscalar instruction issue gave rise to more advanced memory subsystems and to more advanced branch processing. The desire to further increase per cycle performance of first generation superscalars called for avoiding their issue bottleneck by the introduction of shelving, renaming and a concerted enhancement of all relevant subsystems of the microarchitecture. Finally, the utilization of intra-instruction parallelism through SIMD instructions required an adequate extension of the ISA and the system architecture. With the main dimensions of the parallelism - more or less exhausted in the second generation superscalars for general purpose applications -, increasing the clock frequency remained the single major possibility to increase performance further on. The rapid increase of the clock frequencies, however led to limits of evolution, as discussed in Chapter II.
3
Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook
4
1. Paradigms of ILP-processing
5
mainframe 195019601970198019902000 minicomputer microcomputer x UNIVAC 4004 /370/390z/900 server/workstation desktop PC value PC 80808088 80286 80386 80486 Pentium PIIPIIIP4 Celeron /360 PDP-8PDP-11VAX RS/6000 PPro Xeon super- computer ENIACCDC-6600 ? Cray-1Cray-2NORCCray-3Cray T3E Cray-4 8088 Altair Figure 1.1: Evolution of computer classes 1.1. Introduction (1)
6
1.2. ábra: The integer performance of Intel’s x86 line of processors 1.1. Introduction (2)
7
Pipeline processors Temporal parallelism Issue parallelism Paradigms of ILP-processing Static dependency resolution VLIW processors 1.2. Paradigms of ILP-processing (1)
8
VLIW processing FEFE FEFE FEFE VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Instructions
9
Pipeline processors Temporal parallelism Issue parallelism Paradigms of ILP processing Static dependency resolution Dynamic dependency resolution VLIW processors Superscalar processors 1.2. Paradigms of ILP processing (1)
10
Instructions VLIW processing FEFE FEFE FEFE VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Superscalar processing FEFE FEFE FEFE Dynamic dependency resolution Processor Dependent instructions
11
SIMD extension Data parallelism 1.2. Paradigms of ILP processing (1) Pipeline processors Temporal parallelism Issue parallelism Paradigms of ILP processing Static dependency resolution Dynamic dependency resolution VLIW processors Superscalar processors
12
1.2. Paradigms of ILP-processing (2) ~ ‘90~ ‘85 ~ ’95 -‘00 Superscalar processors Pipeline processors. VLIW processorsEPIC processors Superscalar proc.s with SIMD extension Figure 1.3: The emergence of ILP-paradigms and processor types Sequential processing Temporal parallelism Issue parallelism Data parallelism Static dependency resolution Dynamic dependency resolution
13
1.3. Performance potential of ILP-processors (1) Absolute performance Ideal case Real case Sequential Pipeline VLIW/ superscalar SIMD extension
14
1.3. ILP processzorok teljesítménypotenciálja (2) Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application Clock frequency Temporal parall. Issue parall. Data parall. Efficiency of spec. exec. Performance components of ILP-processors: with:
15
2. Introduction of temporal parallelism
16
(F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Overlapping all phases i i +1 i i +3 i i +2 i i F EW D Pipeline processors Atlas (1963) 37 IBM 360/91 (1967) 38 R2000 (1988) 41 i80386 (1985) 42 M68030 (1988) 43 2.1. Introduction (1) Types of temporal parallelism in ILP processors Figure 2.1: Implementation alternatives of temporal parallelism
17
2.1. Introduction (2) Figure 2.2: The appearance of pipeline processors x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 Pipeline (scalar) processors R2000 68020 80286
18
2.2. Processing bottlenecks evoked and their resolution The problem of branch processing The scarcity of memory bandwidth (2.2.2) (2.2.3) 2.2.1. Overview
19
2.2.2. The scarcity of memory bandwidth (1) Larger memory bandwidth Sequential processing Pipeline processing More instructions and data need to be fetched per cycle
20
2.2.2. The scarcity of memory bandwidth (2) Figure 2.3: Introduction of caches x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4)C(4,4) C(16) C(8,8) C(0,1/4) R2000 68020 80286 Pipeline (scalar) processors with cache(s) C(n) Universal cache (size in kB) C(n/m) Instruction/data cache (sizes in kB) Pipeline (scalar) processors without cache(s)
21
i i+2 i i+1 2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) Figure 2.4: Processing of a conditional branch on a 4-stage pipeline F bti i i+4 D E W Brach address calculation F D E Condition checking (branch!) D F Decode F bc i clock cycles Branch target instruction bti Conditional branch bc
22
2.2.3. The problem of branch processing (2) Figure 2. 5: Principle of branch prediction in case of a conditional branch Conditional branches Instructions other than conditional branches Guessed path Basic block Basic block Approved path
23
2.2.3. The problem of branch processing (3) x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4)C(4,4) C(16) C(8,8) (Scalar) pipeline processors Speculative execution of branches C(0,1/4) R2000 68020 80286 Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors
24
2. generation pipelined 1.5. generation pipelined 1. generation pipelined CacheSpeculative branch processing no yesno yes 2.3. Generations of pipeline processors (1)
25
2.3. Generations of pipeline processors (2) x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4)C(4,4) C(16) C(8,8) 2. generation pipelined (cache, speculative branch processing) C(0,1/4) R2000 68020 80286 1.5. generation pipelined (cache, no speculative branch processing) 1. generation pipelined (no cache, no speculative branch processing) Figure 2. 7: Generations of pipeline processors
26
2. generation pipeline processors already exhaust the available temporal parallelism 2.4. Exhausting the available temporal parallelism
27
3. Introduction of issue parallelism
28
Pipeline processing Superscalar instruction issue VLIW (EPIC) instruction issue Static dependency resolution (3.2) Dynamic dependency resolution (3.3) 3.1. Options to implement issue parallelism
29
3.2. VLIW processing (1) E U E U E U E U Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) Figure 3.1: Principle of VLIW processing
30
3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Complex VLIW compiler Static dependency resulution with parallel optimization
31
3.2. VLIW processing (3) Figure 3.2: Experimental and commercially available VLIW processors The term ‘VLIW’ Source: Sima et al., ACA, Addison-Wesley, 1997
32
3.2. VLIW processing (4) Benefits of static dependecy resolution: Earlier appearance Either higher f c or larger ILP Less complex processors
33
3.2. VLIW processing (5) The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization Drawbacks of static dependency resolution: New proc. models require new compiler versions Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market
34
3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth
35
3.2. VLIW processing (7) Commercial VLIW processors: In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors Trace (1987) Multiflow Cydra-5 (1989) Cydrome
36
3.2. VLIW processing (8) Integration of SIMD instructions and advanced superscalar features VLIWEPIC 1994: Intel, HP announced the cooperation 2001: IA-64 Itanium 1997: The EPIC term was born
37
3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA
38
3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
39
3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
40
3.3.2. Attributes of first generation superscalars (1) Cache: Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” Core: Static branch prediction Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Examples: Pentium PA 7100 Alpha 21064
41
Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: (Wall 1989, Lam, Wilson 1992) 3.3.2. Attributes of first generation superscalars (2) FX instrtuctions~ 40 % Load instructions~ 30 % Store instructions~ 10 % Branches~ 20 % FP instrtuctions~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle Source: Sima et al., ACA, Addison-Wesley, 1997
42
Required EU-s (Each L/S instruction generates an address calculation as well): 2 - 3 instructions/cycle Single port data caches Required number of data cache ports (n p ): Reasonable core width: n p ~ 0.4 * (2 - 3)= 0.8 – 1.2 instructions/cycle FX~ 0.8 * (2 – 3) = 1.6 – 2.42 – 3 FX EUs L/S~ 0.4 * (2 – 3) = 0.8 – 1.21 L/S EU Branch~ 0.2 * (2 – 3) = 0.4 – 0.61 B EU FP~ (0.01 – 0.05) * (2 – 3) 1 FP EU Consistency of processor features (2) 3.3.2. Attributes of first generation superscalars (3)
43
The issue bottleneck (b): The issue process (a): Simplified structure of the mikroarchitecture assuming direct issue Figure 3.5: The principle of direct issue 3.3.3. The bottleneck evoked and its resolution (1)
44
3.3.3. The bottleneck evoked and its resolution (2) Figure 3.6: Principle of the buffered (out of order) issue Eliminating the issue bottleneck
45
3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core
46
3.3.4. Attributes of second generation superscalars (1) Caches: Core: First generation ”narrow” superscalars Second generation ”wide” superscalars Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” 4 RISC instructions/cycles or 3 CISC instruction/cycle „wide” Static branch prediction Buffered (ooo) issue Predecoding Dynamic branch prediction Register renaming ROB Single-ported, blocking L1 data caches Off-chip L2 caches attached via the processor bus Dual-ported, non-blocking L1 data caches direct attached off-chip L2 caches Examples: Pentium Pentium Pro K6 PA 7100 PA 8000 Alpha 21064 Alpha 21264
47
Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: (Wall 1990) 3.3.4. Attributes of second generation superscalars (2) FX instrtuctions~ 40 % Load instructions~ 30 % Store instructions~ 10 % Branches~ 20 % FP instrtuctions~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle Source: Sima et al., ACA, Addison-Wesley, 1997
48
Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990 Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue
49
Required EU-s (Each L/S instruction generates an address calculation as well): 4 - 5 instructions/cycle Dual port data caches Required number of data cache ports (n p ): Reasonable core width: n p ~ 0.4 * (4 - 5)= 1.6 – 2 instructions/cycle FX~ 0.8 * (4 – 5) = 3.2 – 43 – 4 FX EUs L/S~ 0.4 * (4 – 5) = 1.6 – 22 L/S EU Branch~ 0.2 * (4 – 5) = 0.8 – 11 B EU FP~ (0.01 – 0.05) * (4 – 5) 1 FP EU Consistency of processor features (2) 3.3.4. Attributes of second generation superscalars (3)
50
In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level 3.3.5. Exhausting the issue parallelism
51
4. Introduction of data parallelism
52
4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism
53
Superscalar issue Multiple operations within a single instruction Superscalar extension EPIC extension SIMD instructions (FX/FP) 4.1. Overview (2) Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors
54
4.2. The appeareance of SIMD instructions in superscalars (1) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional)
55
A 2.5. and 3. generation superscalars (1) Second generation superscalars 3. generation superscalars 2.5. generation superscalars FX SIMD (MM) FX SIMD + FP SIMD (MM+3D)
56
2.5. and 3. generation superscalars (2) Figure 4.4: The emergence of 2.5. and 3. generation superscalars
57
Bottlenecks evoked by third generation superscalars System architecture (memory, display) AGP bus On-chip L2
58
Performance Complexity Memory Bandwidth Branch prediction accuracy 4.3. Overview of superscalar processor generations
59
In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well. Thus the era of ILP-processors came to an end. 4.4. Exhausting the performance potential of data parallelism
60
4.5. The introduction of SIMD instructions in EPIC (VLIW) processors VLIW architectures/processors did not support SIMD instructions EPIC architectectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)
61
5. Summing up the main road of evolution
62
Introduction and increase of temporal parallelism Introduction and increase of issue parallelism Introduction of VLIW processing Introduction and increase of data parallelism Introduction of data parallelism (EPIC) a. Evolutionary scenario (Superscalar approach) (The main road) b. Radical scenario (VLIW/EPIC approach) 5.1. Main evolution scenarios
63
Traditional von N. procs. Superscalar processors with SIMD extension + Data parallelism Superscalar processors + Issue parallelism Pipeline processors Temporal parallelism Figure 5.1: The three cycles of the main road of processor evolution Extent of opereration level parallelism Level of hardware redundancy Sequential ILP ~ 1985/88~ 1990/93~ 1994/ 00 processing t 5.2. Main road of processor evolution (1)
64
introduction of a particular dimension of parallelism processing bottleneck(s) arise elimination of the bottleneck(s) evoked by introducing appropriate techniques as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism i: Introduction of temporal issue and data parallelism i=1:3 5.2. The main road of evolution (2) Figure 5.2: Three main cycles of the main road
65
Figure 5.3: New techniques introduced in the three main cycles of processor evolution Introduction of Advanced memory subsystem Advanced branch processing processors Superscalar Introduction of issue parallelism SIMD extension Superscalars with ISA extension Introduction of data parallelism Traditional sequential processing ~ 1985/88~ 1990/93~ 1994/97 1. generation 2. generation Dynamic inst. scheduling Renaming Predecoding Dynamic branch prediction ROB Dual ported data caches Nonblocking L1 data caches with multiple cache misses allowed Off-chip direct coupled L2 caches 2.5. generation FX SIMD extension Extension of system architecture AGP On-chip L2... 3. generation FP SIMD extension... Introduction of temporal parallelism 1. generation 1.5. generation Caches 2. generation Branch prediction Traditional sequential procesors Pipeline processors 5.2. Main road of the evolution (3)
66
~ 1985 t ~ 2000 Memory Bandwidth Hardware Complexity Performance Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance 5.2. Main road of evolution (4)
67
~ 1985 t ~ 2000 Accuracy of branch prediction Number of pipeline stages fcfc Figure 5.5: Branch prediction accuracy vs raising clock rates 5.2. Main road of evolution (5)
68
6. Outlook: introduction of thread level par.
69
6. Outlook: the introduction of thread level parallelism (1) ILP (instruction-level parallelism) TP (thread-level parallelism) Thread (instruction flow) Multiple threds Granularity of parallelism
70
6. Outlook: the introduction of thread level parallelism (2) Where multiple threads can come from? from the same applications Multiprogramming Multitasking, Multithreading from different applications
71
6. Outlook: the introduction of thread level parallelism (3) Basic implementation alternatives of thread level parallelism implementation by two or more cores placed on the same chip implementation by a multithreaded core Chip CMP: Chip Multiprocessing (SMP: Symmetric Multiprocessing) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) L3/Memory L2/L3 Core L3/Memory SMT core L2/L3
72
6. Outlook: the introduction of thread level parallelism (4) (Four-way) superscalar Multithreaded superscalar (four-way/two threads) Thread Thread 2Thread 1 SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel))
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.