Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

Foreword The steady demand for higher processor performance has provoked the successive introduction of temporal, issue and intra-instruction parallelism into processor operation. Consequently, traditional sequential processors, pipelined processors, superscalar processors and superscalar processors with multimedia and 3D support mark subsequent evolutionary phases of microprocessors. On the other hand the introduction of each basic technique mentioned gave rise to specific system bottlenecks whose resolution called for innovative new techniques. Thus, the emergence of pipelined instruction processing stimulated the introduction of caches and of speculative branch processing. The debut of superscalar instruction issue gave rise to more advanced memory subsystems and to more advanced branch processing. The desire to further increase per cycle performance of first generation superscalars called for avoiding their issue bottleneck by the introduction of shelving, renaming and a concerted enhancement of all relevant subsystems of the microarchitecture. Finally, the utilization of intra-instruction parallelism through SIMD instructions required an adequate extension of the ISA and the system architecture. With the main dimensions of the parallelism - more or less exhausted in the second generation superscalars for general purpose applications -, increasing the clock frequency remained the single major possibility to increase performance further on. The rapid increase of the clock frequencies, however led to limits of evolution, as discussed in Chapter II.

Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook

1. Paradigms of ILP-processing

mainframe 195019601970198019902000 minicomputer microcomputer x UNIVAC 4004 /370/390z/900 server/workstation desktop PC value PC 80808088 80286 80386 80486 Pentium PIIPIIIP4 Celeron /360 PDP-8PDP-11VAX RS/6000 PPro Xeon super- computer ENIACCDC-6600 ? Cray-1Cray-2NORCCray-3Cray T3E Cray-4 8088 Altair Figure 1.1: Evolution of computer classes 1.1. Introduction (1)

1.2. ábra: The integer performance of Intel’s x86 line of processors 1.1. Introduction (2)

Pipeline processors Temporal parallelism Issue parallelism Paradigms of ILP-processing Static dependency resolution VLIW processors 1.2. Paradigms of ILP-processing (1)

VLIW processing FEFE FEFE FEFE VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Instructions

Pipeline processors Temporal parallelism Issue parallelism Paradigms of ILP processing Static dependency resolution Dynamic dependency resolution VLIW processors Superscalar processors 1.2. Paradigms of ILP processing (1)

Instructions VLIW processing FEFE FEFE FEFE VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Superscalar processing FEFE FEFE FEFE Dynamic dependency resolution Processor Dependent instructions

SIMD extension Data parallelism 1.2. Paradigms of ILP processing (1) Pipeline processors Temporal parallelism Issue parallelism Paradigms of ILP processing Static dependency resolution Dynamic dependency resolution VLIW processors Superscalar processors

1.2. Paradigms of ILP-processing (2) ~ ‘90~ ‘85 ~ ’95 -‘00 Superscalar processors Pipeline processors. VLIW processorsEPIC processors Superscalar proc.s with SIMD extension Figure 1.3: The emergence of ILP-paradigms and processor types Sequential processing Temporal parallelism Issue parallelism Data parallelism Static dependency resolution Dynamic dependency resolution

1.3. Performance potential of ILP-processors (1) Absolute performance Ideal case Real case Sequential Pipeline VLIW/ superscalar SIMD extension

1.3. ILP processzorok teljesítménypotenciálja (2) Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application Clock frequency Temporal parall. Issue parall. Data parall. Efficiency of spec. exec. Performance components of ILP-processors: with:

2. Introduction of temporal parallelism

(F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Overlapping all phases i i +1 i i +3 i i +2 i i F EW D Pipeline processors Atlas (1963) 37 IBM 360/91 (1967) 38 R2000 (1988) 41 i80386 (1985) 42 M68030 (1988) 43 2.1. Introduction (1) Types of temporal parallelism in ILP processors Figure 2.1: Implementation alternatives of temporal parallelism

2.1. Introduction (2) Figure 2.2: The appearance of pipeline processors x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 Pipeline (scalar) processors R2000 68020 80286

2.2. Processing bottlenecks evoked and their resolution The problem of branch processing The scarcity of memory bandwidth (2.2.2) (2.2.3) 2.2.1. Overview

2.2.2. The scarcity of memory bandwidth (1) Larger memory bandwidth Sequential processing Pipeline processing More instructions and data need to be fetched per cycle

2.2.2. The scarcity of memory bandwidth (2) Figure 2.3: Introduction of caches x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4)C(4,4) C(16) C(8,8) C(0,1/4) R2000 68020 80286 Pipeline (scalar) processors with cache(s) C(n) Universal cache (size in kB) C(n/m) Instruction/data cache (sizes in kB) Pipeline (scalar) processors without cache(s)

i i+2 i i+1 2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) Figure 2.4: Processing of a conditional branch on a 4-stage pipeline F bti i i+4 D E W Brach address calculation F D E Condition checking (branch!) D F Decode F bc i clock cycles Branch target instruction bti Conditional branch bc

2.2.3. The problem of branch processing (2) Figure 2. 5: Principle of branch prediction in case of a conditional branch Conditional branches Instructions other than conditional branches Guessed path Basic block Basic block Approved path

2.2.3. The problem of branch processing (3) x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4)C(4,4) C(16) C(8,8) (Scalar) pipeline processors Speculative execution of branches C(0,1/4) R2000 68020 80286 Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors

2. generation pipelined 1.5. generation pipelined 1. generation pipelined CacheSpeculative branch processing no yesno yes 2.3. Generations of pipeline processors (1)

2.3. Generations of pipeline processors (2) x86 M68000 MIPS R 198081828384858687888919909192 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4)C(4,4) C(16) C(8,8) 2. generation pipelined (cache, speculative branch processing) C(0,1/4) R2000 68020 80286 1.5. generation pipelined (cache, no speculative branch processing) 1. generation pipelined (no cache, no speculative branch processing) Figure 2. 7: Generations of pipeline processors

2. generation pipeline processors already exhaust the available temporal parallelism 2.4. Exhausting the available temporal parallelism

3. Introduction of issue parallelism

Pipeline processing Superscalar instruction issue VLIW (EPIC) instruction issue Static dependency resolution (3.2) Dynamic dependency resolution (3.3) 3.1. Options to implement issue parallelism

3.2. VLIW processing (1) E U E U E U E U Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) Figure 3.1: Principle of VLIW processing

3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Complex VLIW compiler Static dependency resulution with parallel optimization

3.2. VLIW processing (3) Figure 3.2: Experimental and commercially available VLIW processors The term ‘VLIW’ Source: Sima et al., ACA, Addison-Wesley, 1997

3.2. VLIW processing (4) Benefits of static dependecy resolution: Earlier appearance Either higher f c or larger ILP Less complex processors

3.2. VLIW processing (5) The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization Drawbacks of static dependency resolution: New proc. models require new compiler versions Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market

3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth

3.2. VLIW processing (7) Commercial VLIW processors: In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors Trace (1987) Multiflow Cydra-5 (1989) Cydrome

3.2. VLIW processing (8) Integration of SIMD instructions and advanced superscalar features VLIWEPIC 1994: Intel, HP announced the cooperation 2001: IA-64  Itanium 1997: The EPIC term was born

3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA

3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.2. Attributes of first generation superscalars (1) Cache: Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” Core: Static branch prediction Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Examples: Pentium PA 7100 Alpha 21064

Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: (Wall 1989, Lam, Wilson 1992) 3.3.2. Attributes of first generation superscalars (2) FX instrtuctions~ 40 % Load instructions~ 30 % Store instructions~ 10 % Branches~ 20 % FP instrtuctions~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle Source: Sima et al., ACA, Addison-Wesley, 1997

Required EU-s (Each L/S instruction generates an address calculation as well): 2 - 3 instructions/cycle Single port data caches Required number of data cache ports (n p ): Reasonable core width: n p ~ 0.4 * (2 - 3)= 0.8 – 1.2 instructions/cycle FX~ 0.8 * (2 – 3) = 1.6 – 2.42 – 3 FX EUs L/S~ 0.4 * (2 – 3) = 0.8 – 1.21 L/S EU Branch~ 0.2 * (2 – 3) = 0.4 – 0.61 B EU FP~ (0.01 – 0.05) * (2 – 3) 1 FP EU Consistency of processor features (2) 3.3.2. Attributes of first generation superscalars (3)

The issue bottleneck (b): The issue process (a): Simplified structure of the mikroarchitecture assuming direct issue Figure 3.5: The principle of direct issue 3.3.3. The bottleneck evoked and its resolution (1)

3.3.3. The bottleneck evoked and its resolution (2) Figure 3.6: Principle of the buffered (out of order) issue Eliminating the issue bottleneck

3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core

3.3.4. Attributes of second generation superscalars (1) Caches: Core: First generation ”narrow” superscalars Second generation ”wide” superscalars Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” 4 RISC instructions/cycles or 3 CISC instruction/cycle „wide” Static branch prediction Buffered (ooo) issue Predecoding Dynamic branch prediction Register renaming ROB Single-ported, blocking L1 data caches Off-chip L2 caches attached via the processor bus Dual-ported, non-blocking L1 data caches direct attached off-chip L2 caches Examples: Pentium Pentium Pro K6 PA 7100 PA 8000 Alpha 21064 Alpha 21264

Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: (Wall 1990) 3.3.4. Attributes of second generation superscalars (2) FX instrtuctions~ 40 % Load instructions~ 30 % Store instructions~ 10 % Branches~ 20 % FP instrtuctions~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle Source: Sima et al., ACA, Addison-Wesley, 1997

Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990 Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue

Required EU-s (Each L/S instruction generates an address calculation as well): 4 - 5 instructions/cycle Dual port data caches Required number of data cache ports (n p ): Reasonable core width: n p ~ 0.4 * (4 - 5)= 1.6 – 2 instructions/cycle FX~ 0.8 * (4 – 5) = 3.2 – 43 – 4 FX EUs L/S~ 0.4 * (4 – 5) = 1.6 – 22 L/S EU Branch~ 0.2 * (4 – 5) = 0.8 – 11 B EU FP~ (0.01 – 0.05) * (4 – 5) 1 FP EU Consistency of processor features (2) 3.3.4. Attributes of second generation superscalars (3)

In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level 3.3.5. Exhausting the issue parallelism

4. Introduction of data parallelism

4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism

Superscalar issue Multiple operations within a single instruction Superscalar extension EPIC extension SIMD instructions (FX/FP) 4.1. Overview (2) Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors

4.2. The appeareance of SIMD instructions in superscalars (1) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional)

A 2.5. and 3. generation superscalars (1) Second generation superscalars 3. generation superscalars 2.5. generation superscalars FX SIMD (MM) FX SIMD + FP SIMD (MM+3D)

2.5. and 3. generation superscalars (2) Figure 4.4: The emergence of 2.5. and 3. generation superscalars

Bottlenecks evoked by third generation superscalars System architecture (memory, display) AGP bus On-chip L2

Performance Complexity Memory Bandwidth Branch prediction accuracy 4.3. Overview of superscalar processor generations

In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well. Thus the era of ILP-processors came to an end. 4.4. Exhausting the performance potential of data parallelism

4.5. The introduction of SIMD instructions in EPIC (VLIW) processors VLIW architectures/processors did not support SIMD instructions EPIC architectectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)

5. Summing up the main road of evolution

Introduction and increase of temporal parallelism Introduction and increase of issue parallelism Introduction of VLIW processing Introduction and increase of data parallelism Introduction of data parallelism (EPIC) a. Evolutionary scenario (Superscalar approach) (The main road) b. Radical scenario (VLIW/EPIC approach) 5.1. Main evolution scenarios

Traditional von N. procs. Superscalar processors with SIMD extension + Data parallelism Superscalar processors + Issue parallelism Pipeline processors Temporal parallelism Figure 5.1: The three cycles of the main road of processor evolution Extent of opereration level parallelism Level of hardware redundancy Sequential ILP ~ 1985/88~ 1990/93~ 1994/ 00 processing t 5.2. Main road of processor evolution (1)

introduction of a particular dimension of parallelism processing bottleneck(s) arise elimination of the bottleneck(s) evoked by introducing appropriate techniques as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism i: Introduction of temporal  issue and  data parallelism i=1:3 5.2. The main road of evolution (2) Figure 5.2: Three main cycles of the main road

Figure 5.3: New techniques introduced in the three main cycles of processor evolution Introduction of Advanced memory subsystem Advanced branch processing processors Superscalar Introduction of issue parallelism SIMD extension Superscalars with ISA extension Introduction of data parallelism Traditional sequential processing ~ 1985/88~ 1990/93~ 1994/97 1. generation  2. generation Dynamic inst. scheduling Renaming Predecoding Dynamic branch prediction ROB Dual ported data caches Nonblocking L1 data caches with multiple cache misses allowed Off-chip direct coupled L2 caches 2.5. generation FX SIMD extension Extension of system architecture AGP On-chip L2...  3. generation FP SIMD extension... Introduction of temporal parallelism 1. generation  1.5. generation Caches  2. generation Branch prediction Traditional sequential procesors Pipeline processors 5.2. Main road of the evolution (3)

~ 1985 t ~ 2000 Memory Bandwidth Hardware Complexity Performance Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance 5.2. Main road of evolution (4)

~ 1985 t ~ 2000 Accuracy of branch prediction Number of pipeline stages fcfc Figure 5.5: Branch prediction accuracy vs raising clock rates 5.2. Main road of evolution (5)

6. Outlook: introduction of thread level par.

6. Outlook: the introduction of thread level parallelism (1) ILP (instruction-level parallelism) TP (thread-level parallelism) Thread (instruction flow) Multiple threds Granularity of parallelism

6. Outlook: the introduction of thread level parallelism (2) Where multiple threads can come from? from the same applications Multiprogramming Multitasking, Multithreading from different applications

6. Outlook: the introduction of thread level parallelism (3) Basic implementation alternatives of thread level parallelism implementation by two or more cores placed on the same chip implementation by a multithreaded core Chip CMP: Chip Multiprocessing (SMP: Symmetric Multiprocessing) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) L3/Memory L2/L3 Core L3/Memory SMT core L2/L3

6. Outlook: the introduction of thread level parallelism (4) (Four-way) superscalar Multithreaded superscalar (four-way/two threads) Thread Thread 2Thread 1 SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel))

Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Similar presentations

Presentation on theme: "Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Similar presentations

Presentation on theme: "Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007."— Presentation transcript:

Similar presentations

About project

Feedback