© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 6: Multicore Systems

High Performing Cache Hierarchies for Server Workloads

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.

Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.

CS 7810 Lecture 23 Maximizing CMP Throughput with Mediocre Cores J. Davis, J. Laudon, K. Olukotun Proceedings of PACT-14 September 2005.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.

Chapter Hardwired vs Microprogrammed Control Multithreading

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Computer Architecture Instruction Set Architecture (IBM 360) –… the attributes of a [computing] system as seen by the programmer. I.e. the conceptual.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

© 2004 Mark D. HillWisconsin Multifacet Project A Future for Parallel Computer Architectures Mark D. Hill Computer Sciences Department University of Wisconsin—Madison.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

3 1 3 C H A P T E R Hardware: Input, Processing, and Output Devices.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Sam Sandbote CSE 8383 Advanced Computer Architecture Multithreaded Microprocessors and Multiprocessor SoCs Sam Sandbote CSE 8383 Advanced Computer Architecture.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Multi-Core Architectures

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Grad Student Visit DayUniversity of Wisconsin-Madison Wisconsin Computer Architecture Guri SohiMark HillMikko LipastiDavid WoodKaru Sankaralingam Nam Sung.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Hyper-Threading Technology Architecture and Microarchitecture

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

Morgan Kaufmann Publishers

Pipelining and Parallelism Mark Staveley

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Computer Architecture: Multi-Core Evolution and Design Prof. Onur Mutlu Carnegie Mellon University.

Feeding Parallel Machines – Any Silver Bullets? Novica Nosović ETF Sarajevo 8th Workshop “Software Engineering Education and Reverse Engineering” Durres,

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

COMP 740: Computer Architecture and Implementation

Presented by: Nick Kirchem Feb 13, 2004

Lynn Choi School of Electrical Engineering

The University of Adelaide, School of Computer Science

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Prof. Onur Mutlu Carnegie Mellon University 9/17/2012

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

/ Computer Architecture and Design

Hyperthreading Technology

CS775: Computer Architecture

Architecture & Organization 1

Levels of Parallelism within a Single Processor

Adaptive Single-Chip Multiprocessing

Computer Evolution and Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Levels of Parallelism within a Single Processor

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

Presentation transcript:

© 2004 Mark D. HillWisconsin Multifacet Project Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project ( October 2004 Full Disclosure: Consult for Sun & US NSF

Wisconsin Multifacet Project © 2004 Mark D. Hill 2 Executive Summary: Problem Expect computer performance doubling every 2 years Derives from Technology & Architecture Technology will advance for ten or more years But Architecture faces a Rock: Slow Memory –a.k.a. Wall [Wulf & McKee 1995] Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless of the real Moore’s Law (doubling transistors) talk

Wisconsin Multifacet Project © 2004 Mark D. Hill 3 Executive Summary: Recommendation Chip Multiprocessing (CMP) Can Help –Implement multiple processors per chip –>>10x cost-performance for multithreaded workloads –What about software with one apparent thread? Go to Hard Place: Mainstream Multithreading –Make most workloads flourish with chip multiprocessing –Computer architects can help, but long run –Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware) Necessary For Restoring Popular Moore’s Law

Wisconsin Multifacet Project © 2004 Mark D. Hill 4 Outline Executive Summary Background –Moore’s Law –Architecture –Instruction Level Parallelism –Caches Going Forward Processor Architecture Hits Rock Chip Multiprocessing to the Rescue? Go to the Hard Place of Mainstream Multithreading

Wisconsin Multifacet Project © 2004 Mark D. Hill 5 Society Expects A Popular Moore’s Law Computing critical: commerce, education, engineering, entertainment, government, medicine, science, … –Servers (> PCs) –Clients (= PCs) –Embedded (< PCs) Come to expect a misnamed “Moore’s Law” –Computer performance doubles every two years (same cost) –  Progress in next two years = All past progress Important Corollary –Computer cost halves every two years (same performance) –  In ten years, same performance for 3% (sales tax – Jim Gray) Derives from Technology & Architecture talk

Wisconsin Multifacet Project © 2004 Mark D. Hill 6 (Technologist’s) Moore’s Law Provides Transistors Number of transistors per chip doubles every two years (18 months) Merely a “Law” of Business Psychology

Wisconsin Multifacet Project © 2004 Mark D. Hill 7 Performance from Technology & Architecture Reprinted from Hennessy and Patterson,"Computer Architecture: A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

Wisconsin Multifacet Project © 2004 Mark D. Hill 8 Architects Use Transistors To Compute Faster Bit Level Parallelism (BLP) within Instructions Instruction Level Parallelism (ILP) among Instructions Scores of speculative instructions look sequential! Time   Instrns Time   Instrns

Wisconsin Multifacet Project © 2004 Mark D. Hill 9 Architects Use Transistors Tolerate Slow Memory Cache –Small, Fast Memory –Holds information (expected) to be used soon –Mostly Successful Apply Recursively –Level-one cache(s) –Level-two cache Most of microprocessor die area is cache!

Wisconsin Multifacet Project © 2004 Mark D. Hill 10 Outline Executive Summary Background Going Forward Processor Architecture Hits Rock –Technology Continues –Slow Memory –Implications Chip Multiprocessing to the Rescue? Go to the Hard Place of Mainstream Multithreading

Wisconsin Multifacet Project © 2004 Mark D. Hill 11 Future Technology Implications For (at least) ten years, Moore’s Law continues –More repeated doublings of number of transistors per chip –Faster transistors But hard for processor architects to use –More transistors due global wire delays –Faster transistors due too much dynamic power Moreover, hitting a Rock: Slow Memory –Memory access = 100s floating-point multiplies! –a.k.a. Wall [Wulf & McKee 1995]

Wisconsin Multifacet Project © 2004 Mark D. Hill 12 Rock: Memory Gets (Relatively) Slower Reprinted from Hennessy and Patterson,"Computer Architecture: A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

Wisconsin Multifacet Project © 2004 Mark D. Hill 13 Impact of Slow Memory (Rock) Off-Chip Misses are now hundreds of cycles More Realistic Case Good Case! Time   Instrns Time   Instrns I1 I2 I3 I4 window = 4 (64) Compute Phases Memory Phases

Wisconsin Multifacet Project © 2004 Mark D. Hill 14 Implications of Slow Memory (Rock) Increasing Memory Latency hides Compute Phase Near Term Implications –Reduce memory latency –Fewer memory accesses –More Memory Level Parallelism (MLP) Longer Term Implications –What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000? –What can amazing speculative hardware do?

Wisconsin Multifacet Project © 2004 Mark D. Hill 15 Assessment So Far Appears –Popular Moore’s Law (doubling performance) will end soon, regardless of the real Moore’s Law (doubling transistors) Processor performance hitting Rock (Slow Memory) No known way to overcome this, unless Redefine performance in Popular Moore’s Law –From Processor Performance –To Chip Performance

Wisconsin Multifacet Project © 2004 Mark D. Hill 16 Outline Executive Summary Background Going Forward Processor Architecture Hits Rock Chip Multiprocessing to the Rescue? –Small & Large CMPs –CMP Systems –CMP Workload Go to the Hard Place of Mainstream Multithreading

Wisconsin Multifacet Project © 2004 Mark D. Hill 17 Performance for Chip, not Processor or Thread Chip Multiprocessing (CMP) Replicate Processor Private L1 Caches –Low latency –High bandwidth Shared L2 Cache –Larger than if private

Wisconsin Multifacet Project © 2004 Mark D. Hill 18 Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Wisconsin Multifacet Project © 2004 Mark D. Hill 19 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way D$I$

Wisconsin Multifacet Project © 2004 Mark D. Hill 20 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay D$I$ ICS CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$ CPU D$I$

Wisconsin Multifacet Project © 2004 Mark D. Hill 21 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$

Wisconsin Multifacet Project © 2004 Mark D. Hill 22 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL 8

Wisconsin Multifacet Project © 2004 Mark D. Hill 23 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE)  prog., 1K  instr., even/odd interleaving D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE

Wisconsin Multifacet Project © 2004 Mark D. Hill 24 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE):  prog., 1K  instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE Router 4 8GB/s

Wisconsin Multifacet Project © 2004 Mark D. Hill 25 Piranha Processing Node CPU Alpha core: 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way Memory Controller (MC) RDRAM, 12.8GB/sec Protocol Engines (HE & RE):  prog., 1K  instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth D$I$ L2$ ICS CPU D$I$ L2$ CPU D$I$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ L2$ CPU D$I$ MEM-CTL RE HE Router

Wisconsin Multifacet Project © 2004 Mark D. Hill 26 Piranha’s performance margin 3x for OLTP and 2.2x for DSS Piranha has more outstanding misses  better utilizes memory system Single-Chip Piranha Performance

Wisconsin Multifacet Project © 2004 Mark D. Hill 27 Simultaneous Multithreading (SMT) Multiplex S logical processors on each processor –Replicate registers, share caches, & manage other parts –Implementation factors keep S small, e.g., 2-4 Cost-effective gain if threads available –E.g, S=2  1.4x performance Modest cost –Limits waste if additional logical processor(s) not used Worthwhile CMP enhancement

Wisconsin Multifacet Project © 2004 Mark D. Hill 28 Small CMP Systems Use One CMP (with C cores of S-way SMT) –C=[2,16] & S=[2,4]  C*S = [4,64] –Size of a small PC! Directly Connect CMP (C) to Memory Controller (M) or DRAM MCC

Wisconsin Multifacet Project © 2004 Mark D. Hill 29 Medium CMP Systems Use 2-16 CMPs (with C cores of S-way SMT) –Smaller: 2*4*4 = 32 –Larger: 16*16*4 = 1024 –In a single cabinet Connecting CMPs & Memory Controllers/DRAM & many issues CC CC MM MM Processor-Centric MM CC MM CC Dance Hall

Wisconsin Multifacet Project © 2004 Mark D. Hill 30 Inflection Points Inflection point occurs when –Smooth input change leads –Disruptive output change Enough transistors for … –1970s simple microprocessor –1980s pipelined RISC –1990s speculative out-of-order –2000s … CMP will be Server Inflection Point –Expect >10x performance for less cost –Implying, >>10x cost-performance –Early CMPs like old SMPs but expect dramatic advances!

Wisconsin Multifacet Project © 2004 Mark D. Hill 31 So What’s Wrong with CMP Picture? Chip Multiprocessors –Allow profitable use of more transistors –Support modest to vast multithreading –Will be inflection point for commercial servers But –Many workloads have single thread (available to run) –Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing) Go to a Hard Place –Make most workloads flourish with CMPs

Wisconsin Multifacet Project © 2004 Mark D. Hill 32 Outline Executive Summary Background Going Forward Processor Architecture Hits Rock Chip Multiprocessing to the Rescue? Go to the Hard Place of Mainstream Multithreading –Parallel from Fringe to Center –For All of Computer Science!

Wisconsin Multifacet Project © 2004 Mark D. Hill 33 Thread Parallelism from Fringe to Center History –Automatic Computer (vs. Human)  Computer –Digital Computer (vs. Analog)  Computer Must Change –Parallel Computer (vs. Sequential)  Computer –Parallel Algorithm (vs. Sequential)  Algorithm –Parallel Programming (vs. Sequential)  Programming –Parallel Library (vs. Sequential)  Library –Parallel X (vs. Sequential)  X Otherwise, repeated performance doublings unlikely

Wisconsin Multifacet Project © 2004 Mark D. Hill 34 Computer Architects Can Contribute Chip Multiprocessor Design –Transcend pre-CMP multiprocessor design –Intra-CMP has lower latency & much higher bandwidth Hide Multithreading (Helper Threads) Assist Multithreading (Thread-Level Speculation) Ease Multithreaded Programming (Transactions) Provide a “Gentle Ramp to Parallelism” (Hennessy)

Wisconsin Multifacet Project © 2004 Mark D. Hill 35 But All of Computer Science is Needed Hide Multithreading (Libraries & Compilers) Assist Multithreading (Development Environments) Ease Multithreaded Programming (Languages) Divide & Conquer Multithreaded Complexity (Theory & Abstractions) Must Enable –99% of programmers think sequentially while –99% of instructions execute in parallel Enable a “Parallelism Superhighway”

Wisconsin Multifacet Project © 2004 Mark D. Hill 36 Summary (Single-Threaded) Computing faces a Rock: Slow Memory Popular Moore’s Law (doubling performance) will end soon Chip Multiprocessing Can Help –>>10x cost-performance for multithreaded workloads –What about software with one apparent thread? Go to Hard Place: Mainstream Multithreading –Make most workloads flourish with chip multiprocessing –Computer architects can help, but long run –Requires moving multithreading from CS fringe to center Necessary For Restoring Popular Moore’s Law