1. Evolution of ILP-processing Dezső Sima Fall 2006 D. Sima, 2006
Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook
? x 1. Paradigms of ILP-processing 1.1. Introduction (1) mainframe 1950 1960 1970 1980 1990 2000 minicomputer microcomputer x UNIVAC 4004 /370 /390 z/900 server/workstation desktop PC value PC 8080 8088 80286 80386 80486 Pentium PII PIII P4 Celeron /360 PDP-8 PDP-11 VAX RS/6000 PPro Xeon super- computer ENIAC CDC-6600 ? Cray-1 Cray-2 NORC Cray-3 Cray T3E Cray-4 Altair Figure 1.1: Evolution of computer classes
1.2. ábra: The integer performance of Intel’s x86 line of processors 1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors
1.2. Paradigms of ILP-processing (1) Temporal parallelism Issue parallelism Static dependency resolution Pipeline processors VLIW processors
VLIW processing Instructions VLIW: Very Large Instruction Word F E VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Instructions
1.2. Paradigms of ILP processing (1) Temporal parallelism Issue parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors
Superscalar processing VLIW processing F E VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Superscalar processing F E Dynamic dependency resolution Processor Dependent instructions Instructions
1.2. Paradigms of ILP processing (1) Temporal parallelism Issue parallelism Data parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors SIMD extension
1.2. Paradigms of ILP-processing (2) Issue parallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Pipeline processors. Dynamic dependency resolution Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types
1.3. Performance potential of ILP-processors (1) Ideal case Real case Absolute performance Sequential Pipeline VLIW/ superscalar SIMD extension
1.3. ILP processzorok teljesítménypotenciálja (2) Performance components of ILP-processors: Clock frequency Temporal parall. Issue Data Efficiency of spec. exec. with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application
Types of temporal parallelism in ILP processors 2. Introduction of temporal parallelism 2.1. Introduction (1) Types of temporal parallelism in ILP processors Overlapping all phases i +1 +3 +2 F E W D Pipeline processors Atlas (1963) 37 IBM 360/91 (1967) 38 R2000 (1988) 41 i80386 (1985) 42 M68030 (1988) 43 (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism
2.1. Introduction (2) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 Pipeline (scalar) processors R2000 68020 80286 Figure 2.2: The appearance of pipeline processors
2.2. Processing bottlenecks evoked and their resolution 2.2.1. Overview The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)
2.2.2. The scarcity of memory bandwidth (1) Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth
2.2.2. The scarcity of memory bandwidth (2) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4) C(4,4) C(16) C(8,8) C(0,1/4) R2000 68020 80286 Pipeline (scalar) processors with cache(s) C(n) Universal cache (size in kB) C(n/m) Instruction/data cache (sizes in kB) Pipeline (scalar) processors without cache(s) Figure 2.3: Introduction of caches
2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) clock cycles bc ii F D E W Conditional branch bc ii+1 F D E ii+2 F D bti ii+4 Decode F Branch target instruction bti Condition checking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline
2.2.3. The problem of branch processing (2) Conditional branches Instructions other than conditional branches Guessed path Basic block Approved path Figure 2. 5: Principle of branch prediction in case of a conditional branch
2.2.3. The problem of branch processing (3) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4) C(4,4) C(16) C(8,8) (Scalar) pipeline processors Speculative execution of branches C(0,1/4) R2000 68020 80286 Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors
2.3. Generations of pipeline processors (1) Cache Speculative branch processing 1. generation pipelined no no 1.5. generation pipelined yes no 2. generation pipelined yes yes
2.3. Generations of pipeline processors (2) x86 80286 80386 80486 C(0,1/4) C(1/4,1/4) C(4,4) M68000 68020 68030 68040 C(4,4) C(4,4) C(16) C(8,8) MIPS R R2000 R3000 R6000 R4000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 1. generation pipelined (no cache, no speculative branch processing) 1.5. generation pipelined (cache, no speculative branch processing) 2. generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors
2.4. Exhausting the available temporal parallelism 2. generation pipeline processors already exhaust the available temporal parallelism
3. Introduction of issue parallelism 3.1. Options to implement issue parallelism VLIW (EPIC) instruction issue Static dependency resolution (3.2) Pipeline processing Superscalar instruction issue Dynamic dependency resolution (3.3)
3.2. VLIW processing (1) Figure 3.1: Principle of VLIW processing Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) E U E U E U E U Figure 3.1: Principle of VLIW processing
Static dependency resulution with parallel optimization 3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler
3.2. VLIW processing (3) The term ‘VLIW’ Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997
Benefits of static dependecy resolution: 3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP
3.2. VLIW processing (5) Drawbacks of static dependency resolution: Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions
3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth
3.2. VLIW processing (7) Commercial VLIW processors: Trace (1987) Multiflow Cydra-5 (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors
3.2. VLIW processing (8) VLIW EPIC Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64 Itanium
3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA
3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.2. Attributes of first generation superscalars (1) 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” Width: Core: Static branch prediction Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: Examples: Alpha 21064 PA 7100 Pentium
3.3.2. Attributes of first generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997
3.3.2. Attributes of first generation superscalars (3) Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU
3.3.3. The bottleneck evoked and its resolution (1) The issue bottleneck (a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process Figure 3.5: The principle of direct issue
3.3.3. The bottleneck evoked and its resolution (2) Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue
3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core
3.3.4. Attributes of second generation superscalars (1) First generation ”narrow” superscalars Second generation ”wide” superscalars Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” 4 RISC instructions/cycles or 3 CISC instruction/cycle „wide” Static branch prediction Buffered (ooo) issue Predecoding Dynamic branch prediction Register renaming ROB Core: Caches: Single-ported, blocking L1 data caches Off-chip L2 caches attached via the processor bus Dual-ported, non-blocking L1 data caches direct attached off-chip L2 caches Examples: Alpha 21064 Alpha 21264 PA 7100 PA 8000 Pentium Pentium Pro K6
3.3.4. Attributes of second generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997
Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990
3.3.4. Attributes of second generation superscalars (3) Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU
3.3.5. Exhausting the issue parallelism In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level
4. Introduction of data parallelism 4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism
4.1. Overview (2) SIMD instructions (FX/FP) Superscalar extension SIMD instructions (FX/FP) Multiple operations within a single instruction Superscalar issue EPIC extension Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors
4.2. The appeareance of SIMD instructions in superscalars (1) Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars
A 2.5. and 3. generation superscalars (1) Second generation superscalars FX SIMD (MM) 3. generation superscalars FX SIMD + FP SIMD (MM+3D)
2.5. and 3. generation superscalars (2) Figure 4.4: The emergence of 2.5. and 3. generation superscalars
by third generation superscalars Bottlenecks evoked by third generation superscalars System architecture (memory, display) On-chip L2 AGP bus
4.3. Overview of superscalar processor generations Superscalars First Generation Second Generation 2.5 Generation 2-3 RISC instructions/cycle or 4 RISC instructions/cycle or Unbuffered issue No renaming Single ported data caches Static branch prediction Off-chip L2 caches attached via the processor bus No MM/3D support Examples: Alpha 21064 PA 7100 PowerPC 601 SuperSparc Pentium Alpha 21264 PA 8000 PowerPC 604 UltraSparc I, II Pentium Pro PowerPC 620 Pentium II Buffered issue (shelving) Renaming Predecoding Dual ported data caches Dynamic branch prediction Off-chip direct coupled L2 caches K6 FX-SIMD 1 1,4 2 No ROB ROB Blocking L1 data caches or nonblocking caches with up to a single pending cache miss allowed Nonblocking L1 data caches with multiple cache misses allowed Features: ("Thin superscalars") ("Wide superscalars") ("Wide superscalars with MM/3D support") 2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide" No predecoding No renaming. Power2 3 4 Dual ported data cache, optional No off-chip direct coupled L2. Only single ported data cache. Width: Core: Caches: ISA: instructions Third Generation Power 4 Pentium III m (0.18 ) Pentium 4 Athlon (model 4) On-chip L2 caches FX- and FP-SIMD Athlon MP (model 6) dynamic branch prediction. Performance Complexity Memory Bandwidth Branch prediction accuracy
4.4. Exhausting the performance potential of data parallelism In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well. Thus the era of ILP-processors came to an end.
4.5. The introduction of SIMD instructions in EPIC (VLIW) processors VLIW architectures/processors did not support SIMD instructions EPIC architektectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)
5. Summing up the main road of evolution 5.1. Main evolution scenarios a. Evolutionary scenario (Superscalar approach) (The main road) Introduction and increase of issue parallelism Introduction and increase of data parallelism Introduction and increase of temporal parallelism b. Radical scenario (VLIW/EPIC approach) Introduction of VLIW processing Introduction of data parallelism (EPIC)
5.2. Main road of processor evolution (1) Extent of Sequential opereration level ILP processing parallelism Temporal parallelism + Issue parallelism + Data parallelism Traditional von N. procs. Pipeline processors Superscalar processors Superscalar processors with SIMD extension t ~ 1985/88 ~ 1990/93 ~ 1994/00 Level of hardware redundancy Figure 5.1: The three cycles of the main road of processor evolution
5.2. The main road of evolution (2) i: Introduction of temporal issue and data parallelism introduction of a particular dimension of parallelism i=1:3 processing bottleneck(s) arise elimination of the bottleneck(s) evoked by introducing appropriate techniques as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism Figure 5.2: Three main cycles of the main road
5.2. Main road of the evolution (3) Traditional sequential processing Introduction of temporal parallelism Introduction of Introduction of issue parallelism Introduction of data parallelism Traditional sequential procesors Pipeline processors Superscalar Superscalars with processors SIMD extension Advanced memory subsystem ISA extension 1. generation 1.5. generation Caches 2. generation Branch prediction Advanced branch processing 2.5. generation FX SIMD extension Extension of system architecture AGP On-chip L2 ... 3. generation FP SIMD extension ... 1. generation 2. generation Dynamic inst. scheduling Renaming Predecoding Dynamic branch prediction ROB Dual ported data caches Nonblocking L1 data caches with multiple cache misses allowed Off-chip direct coupled L2 caches ~ 1985/88 ~ 1990/93 ~ 1994/97 Figure 5.3: New techniques introduced in the three main cycles of processor evolution
5.2. Main road of evolution (4) Memory Bandwidth Hardware Complexity Performance ~ 1985 ~ 2000 t Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance
5.2. Main road of evolution (5) Accuracy of branch prediction Number of pipeline stages fc ~ 1985 ~ 2000 t Figure 5.5: Branch prediction accuracy vs raising clock rates
Granularity of parallelism 6. Outlook: the introduction of thread level parallelism (1) Granularity of parallelism Multiple threds TP (thread-level parallelism) Thread (instruction flow) ILP (instruction-level parallelism)
Multitasking, Multithreading 6. Outlook: the introduction of thread level parallelism (2) Where multiple threads can come from? from different applications from the same applications Multiprogramming Multitasking, Multithreading
6. Outlook: the introduction of thread level parallelism (3) Implementation of thread level parallelism in microprocessors SMP: Symmetric Multiprocessing (CMP: Chip Multiprocessing) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) implementation by two or more cores placed on the same chip implementation by a multithreaded core L3/Memory SMT core L2/L3 L3/Memory L2/L3 Core Chip
6. Outlook: the introduction of thread level parallelism (4) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) (Four-way) superscalar Multithreaded superscalar (four-way/two threads) Thread Thread 2 Thread 1