1. Evolution of ILP-processing

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Computer Architecture Instruction-Level Parallel Processors

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

A New Era in Processor Evolution Dezső Sima Fall 2007 (Ver. 2.2)  Dezső Sima, 2007.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Microarchitecture of Superscalars (4) Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Computer performance.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Microarchitecture of Superscalars (5) Dynamic Instruction Issue Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Evolution of the ILP Processing Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

II. A new era in processor evolution Dezső Sima Spring 2007 (Ver. 2.0)  Dezső Sima, 2007.

TECH 6 VLIW Architectures {Very Long Instruction Word}

Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

2. A New Era in Processor Evolution Dezső Sima Fall 2006  D. Sima, 2006.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Lecture # 10 Processors Microcomputer Processors.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

William Stallings Computer Organization and Architecture 6th Edition

CS 352H: Computer Systems Architecture

Topics to be covered Instruction Execution Characteristics

Computer Architecture Principles Dr. Mike Frank

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Lynn Choi School of Electrical Engineering

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

INTRODUCTION TO MICROPROCESSORS

Flow Path Model of Superscalars

Hyperthreading Technology

I. Evolution of the ILP Processing

Levels of Parallelism within a Single Processor

Computer Architecture Lecture 4 17th May, 2006

Comparison of Two Processors

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

7. Microarchitecture of Superscalars (5) Dynamic Instruction Issue

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

A new era in processor evolution

Microarchitecture of Superscalars (4) Decoding

Evolution of ISA’s ISA’s have changed over computer “generations”.

What is Computer Architecture?

A New Era in Processor Evolution

What is Computer Architecture?

Levels of Parallelism within a Single Processor

What is Computer Architecture?

Evolution of ISA’s ISA’s have changed over computer “generations”.

Design of Digital Circuits Lecture 19a: VLIW

The University of Adelaide, School of Computer Science

Evolution of ISA’s ISA’s have changed over computer “generations”.

CSE378 Introduction to Machine Organization

Presentation transcript:

1. Evolution of ILP-processing Dezső Sima Fall 2006  D. Sima, 2006

Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook

? x 1. Paradigms of ILP-processing 1.1. Introduction (1) mainframe 1950 1960 1970 1980 1990 2000 minicomputer microcomputer x UNIVAC 4004 /370 /390 z/900 server/workstation desktop PC value PC 8080 8088 80286 80386 80486 Pentium PII PIII P4 Celeron /360 PDP-8 PDP-11 VAX RS/6000 PPro Xeon super- computer ENIAC CDC-6600 ? Cray-1 Cray-2 NORC Cray-3 Cray T3E Cray-4 Altair Figure 1.1: Evolution of computer classes

1.2. ábra: The integer performance of Intel’s x86 line of processors 1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors

1.2. Paradigms of ILP-processing (1) Temporal parallelism Issue parallelism Static dependency resolution Pipeline processors VLIW processors

VLIW processing Instructions VLIW: Very Large Instruction Word F E VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Instructions

1.2. Paradigms of ILP processing (1) Temporal parallelism Issue parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors

Superscalar processing VLIW processing F E VLIW: Very Large Instruction Word Independent instructions (static dependency resolution) Processor Superscalar processing F E Dynamic dependency resolution Processor Dependent instructions Instructions

1.2. Paradigms of ILP processing (1) Temporal parallelism Issue parallelism Data parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors SIMD extension

1.2. Paradigms of ILP-processing (2) Issue parallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Pipeline processors. Dynamic dependency resolution Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types

1.3. Performance potential of ILP-processors (1) Ideal case Real case Absolute performance Sequential Pipeline VLIW/ superscalar SIMD extension

1.3. ILP processzorok teljesítménypotenciálja (2) Performance components of ILP-processors: Clock frequency Temporal parall. Issue Data Efficiency of spec. exec. with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application

Types of temporal parallelism in ILP processors 2. Introduction of temporal parallelism 2.1. Introduction (1) Types of temporal parallelism in ILP processors Overlapping all phases i +1 +3 +2 F E W D Pipeline processors Atlas (1963) 37 IBM 360/91 (1967) 38 R2000 (1988) 41 i80386 (1985) 42 M68030 (1988) 43 (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism

2.1. Introduction (2) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 Pipeline (scalar) processors R2000 68020 80286 Figure 2.2: The appearance of pipeline processors

2.2. Processing bottlenecks evoked and their resolution 2.2.1. Overview The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)

2.2.2. The scarcity of memory bandwidth (1) Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth

2.2.2. The scarcity of memory bandwidth (2) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4) C(4,4) C(16) C(8,8) C(0,1/4) R2000 68020 80286 Pipeline (scalar) processors with cache(s) C(n) Universal cache (size in kB) C(n/m) Instruction/data cache (sizes in kB) Pipeline (scalar) processors without cache(s) Figure 2.3: Introduction of caches

2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) clock cycles bc ii F D E W Conditional branch bc ii+1 F D E ii+2 F D bti ii+4 Decode F Branch target instruction bti Condition checking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline

2.2.3. The problem of branch processing (2) Conditional branches Instructions other than conditional branches Guessed path Basic block Approved path Figure 2. 5: Principle of branch prediction in case of a conditional branch

2.2.3. The problem of branch processing (3) x86 M68000 MIPS R 1980 81 82 83 84 85 86 87 88 89 1990 91 92 80386 80486 68030 68040 R3000 R6000 R4000 C(8) C(1/4,1/4) C(4,4) C(16) C(8,8) (Scalar) pipeline processors Speculative execution of branches C(0,1/4) R2000 68020 80286 Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors

2.3. Generations of pipeline processors (1) Cache Speculative branch processing 1. generation pipelined no no 1.5. generation pipelined yes no 2. generation pipelined yes yes

2.3. Generations of pipeline processors (2) x86 80286 80386 80486 C(0,1/4) C(1/4,1/4) C(4,4) M68000 68020 68030 68040 C(4,4) C(4,4) C(16) C(8,8) MIPS R R2000 R3000 R6000 R4000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 1. generation pipelined (no cache, no speculative branch processing) 1.5. generation pipelined (cache, no speculative branch processing) 2. generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors

2.4. Exhausting the available temporal parallelism 2. generation pipeline processors already exhaust the available temporal parallelism

3. Introduction of issue parallelism 3.1. Options to implement issue parallelism VLIW (EPIC) instruction issue Static dependency resolution (3.2) Pipeline processing Superscalar instruction issue Dynamic dependency resolution (3.3)

3.2. VLIW processing (1) Figure 3.1: Principle of VLIW processing Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) E U E U E U E U Figure 3.1: Principle of VLIW processing

Static dependency resulution with parallel optimization 3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler

3.2. VLIW processing (3) The term ‘VLIW’ Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997

Benefits of static dependecy resolution: 3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP

3.2. VLIW processing (5) Drawbacks of static dependency resolution: Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions

3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth

3.2. VLIW processing (7) Commercial VLIW processors: Trace (1987) Multiflow Cydra-5 (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors

3.2. VLIW processing (8) VLIW EPIC Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64  Itanium

3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA

3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.2. Attributes of first generation superscalars (1) 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” Width: Core: Static branch prediction Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: Examples: Alpha 21064 PA 7100 Pentium

3.3.2. Attributes of first generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.2. Attributes of first generation superscalars (3) Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU

3.3.3. The bottleneck evoked and its resolution (1) The issue bottleneck (a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process Figure 3.5: The principle of direct issue

3.3.3. The bottleneck evoked and its resolution (2) Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue

3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core

3.3.4. Attributes of second generation superscalars (1) First generation ”narrow” superscalars Second generation ”wide” superscalars Width: 2-3 RISC instructions/cycle or 2 CISC instructions/cycle „wide” 4 RISC instructions/cycles or 3 CISC instruction/cycle „wide” Static branch prediction Buffered (ooo) issue Predecoding Dynamic branch prediction Register renaming ROB Core: Caches: Single-ported, blocking L1 data caches Off-chip L2 caches attached via the processor bus Dual-ported, non-blocking L1 data caches direct attached off-chip L2 caches Examples: Alpha 21064 Alpha 21264 PA 7100 PA 8000 Pentium Pentium Pro K6

3.3.4. Attributes of second generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997

Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990

3.3.4. Attributes of second generation superscalars (3) Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU

3.3.5. Exhausting the issue parallelism In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level

4. Introduction of data parallelism 4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism

4.1. Overview (2) SIMD instructions (FX/FP) Superscalar extension SIMD instructions (FX/FP) Multiple operations within a single instruction Superscalar issue EPIC extension Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors

4.2. The appeareance of SIMD instructions in superscalars (1) Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars

A 2.5. and 3. generation superscalars (1) Second generation superscalars FX SIMD (MM) 3. generation superscalars FX SIMD + FP SIMD (MM+3D)

2.5. and 3. generation superscalars (2) Figure 4.4: The emergence of 2.5. and 3. generation superscalars

by third generation superscalars Bottlenecks evoked by third generation superscalars System architecture (memory, display) On-chip L2 AGP bus

4.3. Overview of superscalar processor generations Superscalars First Generation Second Generation 2.5 Generation 2-3 RISC instructions/cycle or 4 RISC instructions/cycle or Unbuffered issue No renaming Single ported data caches Static branch prediction Off-chip L2 caches attached via the processor bus No MM/3D support Examples: Alpha 21064 PA 7100 PowerPC 601 SuperSparc Pentium Alpha 21264 PA 8000 PowerPC 604 UltraSparc I, II Pentium Pro PowerPC 620 Pentium II Buffered issue (shelving) Renaming Predecoding Dual ported data caches Dynamic branch prediction Off-chip direct coupled L2 caches K6 FX-SIMD 1 1,4 2 No ROB ROB Blocking L1 data caches or nonblocking caches with up to a single pending cache miss allowed Nonblocking L1 data caches with multiple cache misses allowed Features: ("Thin superscalars") ("Wide superscalars") ("Wide superscalars with MM/3D support") 2 CISC instructions/cycle "wide" 3 CISC instructions/cycle "wide" No predecoding No renaming. Power2 3 4 Dual ported data cache, optional No off-chip direct coupled L2. Only single ported data cache. Width: Core: Caches: ISA: instructions Third Generation Power 4 Pentium III m (0.18 ) Pentium 4 Athlon (model 4) On-chip L2 caches FX- and FP-SIMD Athlon MP (model 6) dynamic branch prediction. Performance Complexity Memory Bandwidth Branch prediction accuracy

4.4. Exhausting the performance potential of data parallelism In general purpose applications second generation superscalars already exhaust the parallelism available at the instruction level, whereas third generation superscalars exhaust already the parallelism available in dedicated applications (such as MM or 3D applications) at the instruction level as well. Thus the era of ILP-processors came to an end.

4.5. The introduction of SIMD instructions in EPIC (VLIW) processors VLIW architectures/processors did not support SIMD instructions EPIC architektectures/processors inherently support SIMD instructions (like the IA-64 ISA or processors of the Itanium family)

5. Summing up the main road of evolution 5.1. Main evolution scenarios a. Evolutionary scenario (Superscalar approach) (The main road) Introduction and increase of issue parallelism Introduction and increase of data parallelism Introduction and increase of temporal parallelism b. Radical scenario (VLIW/EPIC approach) Introduction of VLIW processing Introduction of data parallelism (EPIC)

5.2. Main road of processor evolution (1) Extent of Sequential opereration level ILP processing parallelism Temporal parallelism + Issue parallelism + Data parallelism Traditional von N. procs. Pipeline processors Superscalar processors Superscalar processors with SIMD extension t ~ 1985/88 ~ 1990/93 ~ 1994/00 Level of hardware redundancy Figure 5.1: The three cycles of the main road of processor evolution

5.2. The main road of evolution (2) i: Introduction of temporal  issue and  data parallelism introduction of a particular dimension of parallelism i=1:3 processing bottleneck(s) arise elimination of the bottleneck(s) evoked by introducing appropriate techniques as a consequence, parallelism available at the given dimension becomes exhausted, further performance increase is achievable only by introducing a new dimension of parallelism Figure 5.2: Three main cycles of the main road

5.2. Main road of the evolution (3) Traditional sequential processing Introduction of temporal parallelism Introduction of Introduction of issue parallelism Introduction of data parallelism Traditional sequential procesors Pipeline processors Superscalar Superscalars with processors SIMD extension Advanced memory subsystem ISA extension 1. generation  1.5. generation Caches  2. generation Branch prediction Advanced branch processing 2.5. generation FX SIMD extension Extension of system architecture AGP On-chip L2 ...  3. generation FP SIMD extension ... 1. generation  2. generation Dynamic inst. scheduling Renaming Predecoding Dynamic branch prediction ROB Dual ported data caches Nonblocking L1 data caches with multiple cache misses allowed Off-chip direct coupled L2 caches ~ 1985/88 ~ 1990/93 ~ 1994/97 Figure 5.3: New techniques introduced in the three main cycles of processor evolution

5.2. Main road of evolution (4) Memory Bandwidth Hardware Complexity Performance ~ 1985 ~ 2000 t Figure 5.4: Memory bandwidth and hardware complexity vs raising processor performance

5.2. Main road of evolution (5) Accuracy of branch prediction Number of pipeline stages fc ~ 1985 ~ 2000 t Figure 5.5: Branch prediction accuracy vs raising clock rates

Granularity of parallelism 6. Outlook: the introduction of thread level parallelism (1) Granularity of parallelism Multiple threds TP (thread-level parallelism) Thread (instruction flow) ILP (instruction-level parallelism)

Multitasking, Multithreading 6. Outlook: the introduction of thread level parallelism (2) Where multiple threads can come from? from different applications from the same applications Multiprogramming Multitasking, Multithreading

6. Outlook: the introduction of thread level parallelism (3) Implementation of thread level parallelism in microprocessors SMP: Symmetric Multiprocessing (CMP: Chip Multiprocessing) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) implementation by two or more cores placed on the same chip implementation by a multithreaded core L3/Memory SMT core L2/L3 L3/Memory L2/L3 Core Chip

6. Outlook: the introduction of thread level parallelism (4) SMT: Simultaneous Multithreading (HT: Hyperthreading (Intel)) (Four-way) superscalar Multithreaded superscalar (four-way/two threads) Thread Thread 2 Thread 1