© 2004 Mark D. HillWisconsin Multifacet Project A Future for Parallel Computer Architectures Mark D. Hill Computer Sciences Department University of Wisconsin—Madison.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Better answers The Alpha and Microprocessors: Continuing the Performance Lead Beyond Y2K Shubu Mukherjee, Ph.D. Principal Hardware Engineer.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.

© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)

1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

1 Lecture 12: ILP Innovations and SMT Today: ILP innovations, SMT, cache basics (Sections 3.5 and supplementary notes)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Simultaneous Multithreading: Multiplying Alpha Performance Dr. Joel Emer Principal Member Technical Staff Alpha Development Group Compaq.

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power ISCA workshops Sign up for class presentations.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Architecture Basics ECE 454 Computer Systems Programming

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

Multi-core architectures. Single-core computer Single-core CPU chip.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Hyper-Threading Technology Architecture and Micro-Architecture.

Tahir CELEBI, Istanbul, 2005 Hyper-Threading Technology Architecture and Micro-Architecture Prepared by Tahir Celebi Istanbul, 2005.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Pipelining and Parallelism Mark Staveley

Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.

The Alpha – Data Stream Matt Ziegler.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

1 Lecture 20: Core Design Today: Innovations for ILP, TLP, power Sign up for class presentations.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Computer Architecture: Multi-Core Evolution and Design Prof. Onur Mutlu Carnegie Mellon University.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

COMP 740: Computer Architecture and Implementation

Presented by: Nick Kirchem Feb 13, 2004

Simultaneous Multithreading

Prof. Onur Mutlu Carnegie Mellon University 9/17/2012

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Flow Path Model of Superscalars

Hyperthreading Technology

Milad Hashemi, Onur Mutlu, Yale N. Patt

Levels of Parallelism within a Single Processor

Address-Value Delta (AVD) Prediction

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Alpha Microarchitecture

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Adaptive Single-Chip Multiprocessing

15-740/ Computer Architecture Lecture 14: Runahead Execution

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

/ Computer Architecture and Design

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Presentation transcript:

© 2004 Mark D. HillWisconsin Multifacet Project A Future for Parallel Computer Architectures Mark D. Hill Computer Sciences Department University of Wisconsin—Madison Multifacet Project ( August 2004 Full Disclosure: Consult for Sun & US NSF

Wisconsin Multifacet Project © 2004 Mark D. Hill 2 Summary Issues –Moore’s Law, etc. –Instruction Level Parallelism for More Performance –But Memory Latency Longer (e.g., 200 FP multiplies) Must Exploit Memory Level Parallelism –At Thread: Runahead & Continual Flow Pipeline –At Processor: Simultaneous Multithreading –At Chip: Chip Multiprocessing

Wisconsin Multifacet Project © 2004 Mark D. Hill 3 Outline Computer Architecture Drivers –Moore’s Law, Microprocessors, & Caching Instruction Level Parallelism (ILP) Review Memory Level Parallelism (MLP) Improving MLP of Thread Improving MLP of a Core or Chip CMP Systems

Wisconsin Multifacet Project © 2004 Mark D. Hill 4 (Technologists) Moore’s Law

Wisconsin Multifacet Project © 2004 Mark D. Hill 5 What If Your Salary? Parameters –$16 base –59% growth/year –40 years Initially $16  buy book 3 rd year’s $64  buy computer game 16 th year’s $27,000  buy car 22 nd year’s $430,000  buy house 40 th year’s > billion dollars  buy a lot You have to find fundamental new ways to spend money!

Wisconsin Multifacet Project © 2004 Mark D. Hill 6 Microprocessor First Microprocessor in 1971 –Processor on one chip –Intel 4004 –2300 transistors –Barely a processor –Could access 300 bytes of memory ( megabytes) Use more and faster transistors in parallel

Wisconsin Multifacet Project © 2004 Mark D. Hill 7 Other “Moore’s Laws” Other technologies improving rapidly –Magnetic disk capacity –DRAM capacity –Fiber-optic network bandwidth Other aspects improving slowly –Delay to memory –Delay to disk –Delay across networks Computer Implementor’s Challenge –Design with dissimilarly expanding resources –To Double computer performance every two years –A.k.a., (Popular) Moore’s Law

Wisconsin Multifacet Project © 2004 Mark D. Hill 8 Caching & Memory Hierarchies, cont. VAX-11/780 –1 Instruction = Memory Now –100s Instructions = Memory Caching Applied Recursively –Registers –Level-one cache –Level-two cache –Memory –Disk –(File Server) –(Proxy Cache)

Wisconsin Multifacet Project © 2004 Mark D. Hill 9 Outline Computer Architecture Drivers Instruction Level Parallelism (ILP) Review –Pipelining & Out-of-Order –Intel P3, P4, & Banias Memory Level Parallelism (MLP) Improving MLP of Thread Improving MLP of a Core or Chip CMP Systems

Wisconsin Multifacet Project © 2004 Mark D. Hill 10 Instruction Level Parallelism (ILP) 101 Non-Pipelined (Faster via Bit Level Parallelism (BLP)) Pipelined (ILP + BLP; 1st microprocessors RISC) Time   Instrns Time   Instrns

Wisconsin Multifacet Project © 2004 Mark D. Hill 11 Instruction Level Parallelism 102 SuperScalar (& Pipelined) Add Cache Misses in red Time   Instrns Time   Instrns What if data independent?

Wisconsin Multifacet Project © 2004 Mark D. Hill 12 Instruction Level Parallelism 103 Out-of-Order (& SuperScalar & Pipelined) In-order fetch, decode, rename, & issuing of instructions with good branch prediction Out-of-order speculative execution of instructions in “window”, honoring data dependencies In-order retirement, preserving sequential instruction semantics Time   Instrns

Wisconsin Multifacet Project © 2004 Mark D. Hill 13 Out-of-Order Example: Intel x86 P6 Core “CISC” Twist to Out-of-Order –In-order front end cracks x86 instructions into micro-ops (like RISC instructions) –Out-of-order execution –In-Order retirement of micro-ops in x86 instruction groups Used in Pentium Pro, II, & III –3-way superscalar of micro-ops –10-stage pipeline (for branch misprediction penalty) –Sophisticated branch prediction –Deep pipeline allowed scaling for many generations

Wisconsin Multifacet Project © 2004 Mark D. Hill 14 Pentium 4 Core [Hinton 2001] Follow basic approach of P6 core Trace Cache stores dynamic micro-op sequences 20-stage pipeline (for branch misprediction penalty) 128 active micro-ops (48 loads & 24 stores) Deep pipeline to allow scaling for many generations

Wisconsin Multifacet Project © 2004 Mark D. Hill 15 Intel Kills Pentium 4 Roadmap Why? I can speculate Too Much Power? –More transistors –Higher-frequency transistors –Designed before power became first-order design constraint Too Little Performance? Time/Program = –Instructions/Program * Cycles/Instruction * Time/Cycle For x86: Instructions/Cycle * Frequency Pent4 Instruction/Cycle loss vs. Frequency gains? Intel moving away from marketing with frequency!

Wisconsin Multifacet Project © 2004 Mark D. Hill 16 Pentium M / Banias [Gochman 2003] For laptops, but now more general –Key: Feature must add 1% performance for 3% power –Why: Increasing voltage for 1% perf. costs 3% power Techniques –Enhance Intel SpeedStep™ –Shorter pipeline (more like P6) –Better branch predictor (e.g., loops) –Special handling of memory stack –Fused micro-ops –Lower power transistors (off critical path)

Wisconsin Multifacet Project © 2004 Mark D. Hill 17 What about Future for Intel & Others? Worry about power & energy (not this talk) Memory latency too great for out-of-order cores to tolerate (coming next) Memory Level Parallelism for Thread, Processor, & Chip!

Wisconsin Multifacet Project © 2004 Mark D. Hill 18 Outline Computer Architecture Drivers Instruction Level Parallelism (ILP) Review Memory Level Parallelism (MLP) –Cause & Effect Improving MLP of Thread Improving MLP of a Core or Chip CMP Systems

Wisconsin Multifacet Project © 2004 Mark D. Hill 19 Out-of-Order w/ Slower Off-Chip Misses Out-of-Order (& Super-Scalar & Pipelined) But Off-Chip Misses are now hundreds of cycles Time   Instrns Good Case! Time   Instrns

Wisconsin Multifacet Project © 2004 Mark D. Hill 20 Out-of-Order w/ Slower Off-Chip Misses More Realistic Case Why does yellow instruction block? –Assumes 4-instruction window (maximum outstanding) –Yellow instruction awaits “instruction - 4” (1 st cache miss) –Actual widows are instructions, but L2 miss slower Key Insight: Memory-Level Parallelism (MLP) [Chou, Fahs, & Abraham, ISCA 2004] Time   Instrns I1 I2 I3 I4 4-instrn window

Wisconsin Multifacet Project © 2004 Mark D. Hill 21 Out-of-Order & Memory Level Parallism (MLP) Good Case Bad Case Compute & Memory Phases MLP = 2 MLP = 1

Wisconsin Multifacet Project © 2004 Mark D. Hill 22 MLP Model MLP = # Off-Chip Accesses / # Memory Phases Execution has Compute & Memory Phases –Compute Phase largely overlaps Memory Phase –In Limit as Memory Latency increases, … Compute Phase hidden by Memory Phase –Execution Time = # Memory Phases * Memory Latency Execution Time = (MLP / # Off-Chip Accesses) * Memory Latency

Wisconsin Multifacet Project © 2004 Mark D. Hill 23 MLP Action Items Execution Time = (MLP / # Off-Chip Accesses) * Memory Latency Reduce # Off-Chip Accesses –E.g., better caches or compression (Multifacet) Reduce Memory Latency –E.g., on-chip memory controller (AMD) Increase MLP (next slides) Processor changes that don’t affect MLP don’t help!

Wisconsin Multifacet Project © 2004 Mark D. Hill 24 What Limits MLP in Processor? [Chou et al.] Issue window and reorder buffer size Instruction fetch off-chip accesses Unresolvable mispredicted branches Load and branch issue restrictions Serializing instructions

Wisconsin Multifacet Project © 2004 Mark D. Hill 25 What Limits MLP in Program? Depending on data from off-chip memory accesses For addresses –Bad: Pointer chasing with poor locality –Good: Array where address calculation separate from data For unpredictable branch decisions –Bad: Branching on data values with poor locality –Good: Iterative loops with highly predictable branching But, as programmer, which accesses go off-chip? Also: very poor instruction locality & frequent system calls, context switches, etc.

Wisconsin Multifacet Project © 2004 Mark D. Hill 26 Outline Computer Architecture Drivers Instruction Level Parallelism (ILP) Review Memory Level Parallelism (MLP) Improving MLP of Thread –Runahead, Continual Flow Pipeline Improving MLP of a Core or Chip CMP Systems

Wisconsin Multifacet Project © 2004 Mark D. Hill 27 Runahead Example Base Out-of-Order, MLP = 1 With Runahead, MLP = 2 I1 I2 I3 I4 4-instrn window 1. Normal mode 3. Runahead mode 2. Checkpoint 5. Normal mode (but faster) 4. Restore checkpoint

Wisconsin Multifacet Project © 2004 Mark D. Hill 28 Runahead Execution [Dundas ICS97, Mutlu HPCA03] 1.Execute normally until instruction M’s off-chip access blocks issue of more instructions 2.Checkpoint processor 3.Discard instruction M, set M’s destination register to poisoned, & speculatively Runahead –Instructions propagate poisoned from source to destination –Seek off-chip accesses to start prefetches & increase MLP 4.Restore checkpoint when off-chip access M returns 5.Resume normal execution (hopefully faster)

Wisconsin Multifacet Project © 2004 Mark D. Hill 29 Continual Flow Pipeline [ Srinivasan ASPLOS04 ] Simplified Example Have off-chip access M free many resources, but SAVE Keep decoding instructions SAVE instructions dependent on M Execute instructions independent of M When M completes, execute SAVED instructions

Wisconsin Multifacet Project © 2004 Mark D. Hill 30 Implications of Runahead, & Continual Flow Runahead –Discards dependent instructions –Speculatively executes independent instructions –When miss returns, re-executes dependent & independent instrns Continual Flow Pipeline –Saves dependent instructions –Executes independent instructions –When miss returns, executes only saved dependent instructions Assessment –Both allow MLP to break past window limits –Both limited by branch prediction accuracy on unresolved branches –Continual Flow Pipeline sounds even more appealing –But may not be worthwhile (vs. Runahead) & memory order issues

Wisconsin Multifacet Project © 2004 Mark D. Hill 31 Outline Computer Architecture Drivers Instruction Level Parallelism (ILP) Review Memory Level Parallelism (MLP) Improving MLP of Thread Improving MLP of a Core or Chip –Core: Simultaneous Multithreading –Chip: Chip Multiprocessing CMP Systems

Wisconsin Multifacet Project © 2004 Mark D. Hill 32 Getting MLP from Thread Level Parallelism Runahead & Continual Flow seek MLP for Thread More MLP for Processor? –More parallel off-chip accesses for a processor? –Yes: Simultaneous Multithreading More MLP for Chip? –More parallel off-chip accesses for a chip? –Yes: Chip Multiprocessing Exploit workload Thread Level Parallelism (TLP)

Wisconsin Multifacet Project © 2004 Mark D. Hill 33 Simultaneous Multithreading [U Washington] Turn a physical processor into S logical processors Need S copies of architectural state, S=2, 4, (8?) –PC, Registers, PSW, etc. (small!) Completely share –Caches, functional units, & datapaths Manage via threshold sharing, partition, etc. –Physical registers, issue queue, & reorder buffer Intel calls Hyperthreading in Pentium 4 –1.4x performance for S=2 with little area, but complexity –But Pentium 4 is now dead & no Hyperthreading in Banias

Wisconsin Multifacet Project © 2004 Mark D. Hill 34 Simultaneous Multithreading Assessment Programming –Supports finer-grained sharing than old-style MP –But gains less than S and S is small Have Multi-Threaded Workload –Hides off-chip latencies better than Runahead –E.g, 4 threads w/ MLP 1.5 each  MLP = 6 Have Single-Threaded Workload –Base SMT No Help –Many “Helper Thread” Ideas Expect SMT in processors for servers Probably SMT even in processors for clients

Wisconsin Multifacet Project © 2004 Mark D. Hill 35 Want to Spend More Transistors Not worthwhile to spend it all on cache Replicate Processor Private L1 Caches –Low latency –High bandwidth Shared L2 Cache –Larger than if private

Wisconsin Multifacet Project © 2004 Mark D. Hill 36 Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Wisconsin Multifacet Project © 2004 Mark D. Hill 37 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way D$I$

Wisconsin Multifacet Project © 2004 Mark D. Hill 38 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay D$I$ ICS CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$

Wisconsin Multifacet Project © 2004 Mark D. Hill 39 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$

Wisconsin Multifacet Project © 2004 Mark D. Hill 40 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL 8

Wisconsin Multifacet Project © 2004 Mark D. Hill 41 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE)  prog., 1K  instr., even/odd interleaving D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE

Wisconsin Multifacet Project © 2004 Mark D. Hill 42 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE):  prog., 1K  instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE Router 4 8GB/s

Wisconsin Multifacet Project © 2004 Mark D. Hill 43 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE):  prog., 1K  instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE Router

Wisconsin Multifacet Project © 2004 Mark D. Hill 44 Simulated Architectures

Wisconsin Multifacet Project © 2004 Mark D. Hill 45 Piranha’s performance margin 3x for OLTP and 2.2x for DSS Piranha has more outstanding misses  better utilizes memory system Single-Chip Piranha Performance

Wisconsin Multifacet Project © 2004 Mark D. Hill 46 Chip Multiprocessing Assessment: Servers Programming –Supports finer-grained sharing than old-style MP –But not as fine as SMT (yet) –Many cores can make performance gain large Can Yield MLP for Chip! –Can do CMP of SMT processors –C cores of S-way SMT with T-way MLP per thread –Yields Chip MLP of C*S*T (e.g., 8*2*2 = 32) Most Servers have Multi-Threaded Workload CMP is a Server Inflection Point –Expect >10x performance for less cost Implying, >>10x cost-performance

Wisconsin Multifacet Project © 2004 Mark D. Hill 47 Chip Multiprocessing Assessment: Clients Most Client (Today) have Single-Threaded Workload –Base CMP No Help –Use Thread Level Speculation? –Use Helper Threads? CMPs for Clients? –Depends on Threads –CMP costs significant chip area (unlike SMT)

Wisconsin Multifacet Project © 2004 Mark D. Hill 48 Outline Computer Architecture Drivers Instruction Level Parallelism (ILP) Review Memory Level Parallelism (MLP) Improving MLP of Thread Improving MLP of a Core or Chip CMP Systems –Small, Medium, but Not Large –Wisconsin Multifacet Token Coherence

Wisconsin Multifacet Project © 2004 Mark D. Hill 49 Small CMP Systems Use One CMP (with C cores of S-way SMT) –C starts 2-4 and grows to 16-ish –S starts at 2, may stay at 2 or grow to 4 –Fits on your desk! Directly Connect CMP (C) to Memory Controller (M) or DRAM If Threads Useful –>10X Performance vs. Uniprocesor –>>10X Cost-Performance vs. non-CMP SMP Commodity Server! MCC

Wisconsin Multifacet Project © 2004 Mark D. Hill 50 Medium CMP Systems Use 2-16 CMPs (with C cores of S-way SMT) –Small: 4*4*2 = 32 –Large: 16*16*4 = 1024 Connect CMPs & Memory Controllers (or DRAM) CC CC MM MM Processor-Centric MM MM CC CC Memory-Centric MM CC MM CC Dance Hall

Wisconsin Multifacet Project © 2004 Mark D. Hill 51 Large CMP Systems? 1000s of CMPs? Will not happen in the commercial market –Instead will network CMP systems into clusters –Enhance availability & reduces cost –Poor latency acceptable Market for large scientific machines probably ~$0 Billion Market for large government machines similar –Nevertheless, government can make this happen (like bombers) The rest of us will use –a small- or medium-CMP system –A cluster of small- or medium-CMP systems

Wisconsin Multifacet Project © 2004 Mark D. Hill 52 Wisconsin Multifacet ( Designing Commercial Servers Availability: SafetyNet Checkpointing [ISCA 2002] Programability: Flight Data Recorder [ISCA 2003] Methods: Simulating a $2M Server on a $2K PC [Computer 2003] Performance: Cache Compression [ISCA 2004] Simplicity & Performance: Token Coherence (next)

Wisconsin Multifacet Project © 2004 Mark D. Hill 53 Token Coherence [IEEE MICRO Top Picks 03] Coherence Invariant (for any memory block at any time): –One writer or multiple readers Implemented with distributed Finite State Machines Indirectly enforced (bus order, acks, blocking, etc.) Token Coherence Directly Enforces –Each memory block has T tokens –Token count store with data (even in messages) –Processor needs all T tokens to write –Processor needs at least one token to read Last year: Glueless Multiprocessor –Speedup 17-54% vs directory This Year: Medium CMP Systems –Flat for correctness –Hierarchical for performance

Wisconsin Multifacet Project © 2004 Mark D. Hill 54 Conclusions Must Exploit Memory Level Parallelism! At Thread: Runahead & Continual Flow Pipeline At Processor: Simultaneous Multithreading At Chip: Chip Multiprocessing Talk to be filed : Google Mark Hill > Publications > 2004