Single-Chip Multiprocessors: the Rebirth of Parallel Architecture Guri Sohi University of Wisconsin.

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.

OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,

Microprocessors VLIW Very Long Instruction Word Computing April 18th, 2002.

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

From Yale-45 to Yale-90: Let Us Not Bother the Programmers Guri Sohi University of Wisconsin-Madison Celebrating September 19, 2014.

Microarchitectural Approaches to Exceeding the Complexity Barrier © Eric Rotenberg 1 Microarchitectural Approaches to Exceeding the Complexity Barrier.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

EECC722 - Shaaban #1 lec # 10 Fall A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

Parallel Execution Models for Future Multicore Architectures Guri Sohi University of Wisconsin.

Chapter Hardwired vs Microprogrammed Control Multithreading

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

Chapter 17 Parallel Processing.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computer System Architectures Computer System Software

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Multi-core architectures. Single-core computer Single-core CPU chip.

(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Revisiting “Multiprocessors Should Support Simple Memory Consistency Models” Mark D. Hill Multifacet.

Grad Student Visit DayUniversity of Wisconsin-Madison Wisconsin Computer Architecture Guri SohiMark HillMikko LipastiDavid WoodKaru Sankaralingam Nam Sung.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Lecture 13: Multiprocessors Kai Bu

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

A few issues on the design of future multicores André Seznec IRISA/INRIA.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

Parallel Computing Presented by Justin Reschke

Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Elec/Comp 526 Spring 2015 High Performance Computer Architecture Instructor Peter Varman DH 2022 (Duncan Hall) rice.edux3990 Office Hours Tue/Thu.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

Lecture 13: Multiprocessors Kai Bu

Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

18-447: Computer Architecture Lecture 30B: Multiprocessors

CS5102 High Performance Computer Systems Thread-Level Parallelism

Multiscalar Processors

Memory Consistency Models

Memory Consistency Models

Levels of Parallelism within a Single Processor

ISCA 2005 Panel Guri Sohi.

Single-Chip Multiprocessors: the Rebirth of Parallel Architecture

Mark D. Hill Multifacet Project ( Computer Sciences Department

Chapter 1 Introduction.

Levels of Parallelism within a Single Processor

Presentation transcript:

Single-Chip Multiprocessors: the Rebirth of Parallel Architecture Guri Sohi University of Wisconsin

2 Outline Waves of innovation in architecture Innovation in uniprocessors Lessons from uniprocessors Future chip multiprocessor architectures Software and such things

3 Waves of Research and Innovation A new direction is proposed or new opportunity becomes available The center of gravity of the research community shifts to that direction  SIMD architectures in the 1960s  HLL computer architectures in the 1970s  RISC architectures in the early 1980s  Shared-memory MPs in the late 1980s  OOO speculative execution processors in the 1990s

4 Waves Wave is especially strong when coupled with a “step function” change in technology  Integration of a processor on a single chip  Integration of a multiprocessor on a chip

5 Uniprocessor Innovation Wave: Part 1 Many years of multi-chip implementations  Generally microprogrammed control  Major research topics: microprogramming, pipelining, quantitative measures Significant research in multiprocessors

6 Uniprocessor Innovation Wave: Part 2 Integration of processor on a single chip  The inflexion point  Argued for different architecture (RISC) More transistors allow for different models  Speculative execution Then the rebirth of uniprocessors  Continue the journey of innovation  Totally rethink uniprocessor microarchitecture  Jim Keller: “Golden Age of Microarchitecture”

7 Uniprocessor Innovation Wave: Results Current uniprocessor very different from 1980’s uniprocessor Uniprocessor research dominates conferences MICRO comes back from the dead  Top 1% (NEC Citeseer) Impact on compilers Source: Rajwar and Hill, 2001

8 Why Uniprocessor Innovation Wave? Innovation needed to happen  Alternatives (multiprocessors) not practical option for using additional transistors Innovation could happen: things could be done differently  Identify barriers (e.g., to performance)  Use transistors to overcome barriers (e.g., via speculation)  Simulation tools facilitate innovation

9 Lessons from Uniprocessors Don’t underestimate what can be done in hardware  Doing things in software was considered easy; in hardware considered hard  Now perhaps the opposite Barriers or limits become opportunities for innovation  Via novel forms of speculation  E.g., barriers in Wall’s study on limits of ILP

10 Multiprocessor Architecture A.k.a. “multiarchitecture” of a multiprocessor Take state-of-the-art uniprocessor Connect several together with a suitable network  Have to live with defined interfaces Expend hardware to provide cache coherence and streamline inter-node communication  Have to live with defined interfaces

11 Software Responsibilities Have software figure out how to use MP Reason about parallelism Reason about execution times and overheads Orchestrate parallel execution Very difficult for software to parallelize transparently

12 Explicit Parallel Programming Have programmer express parallelism Reasoning about parallelism is hard Use synchronization to ease reasoning  Parallel trends towards serial with the use of synchronization

13 Net Result Difficult to get parallelism speedup  Computation is serial  Inter-node communication latencies exacerbate problem Multiprocessors rarely used for parallel execution Used to run threaded programs  Lower-overhead sync would help Used to improve throughput

14 The Inflexion Point for Multiprocessors Can put a basic small-scale MP on a chip Can think of alternative ways of building multiarchitecture Don’t have to work with defined interfaces! What opportunities does this open up? Allows for parallelism to get performance. Allows for use of novel techniques to overcome (software and hardware) barriers Other opportunities (e.g., reliability)

15 Parallel Software Needs to be compelling reason to have a parallel application Won’t happen if difficult to create  Written by programmer or automatically parallelized by compiler Won’t happen if insufficient performance gain

16 Changes in MP Multiarchitecture Inventing new functionality to overcome barriers  Consider barriers as opportunities Developing new models for using CMPs Revisiting traditional use of MPs

17 Speculative Multithreading Speculatively parallelize an application  Use speculation to overcome ambiguous dependences  Use hardware support to recover from misspeculation E.g., multiscalar Use hardware to overcome limitations

18 Overcoming Barriers: Memory Models Weak models proposed to overcome performance limitations of SC Speculation used to overcome “maybe” dependences Series of papers showing SC can achieve performance of weak models

19 Implications Strong memory models not necessarily low performance Programmer does not have to reason about weak models More likely to have parallel programs written

20 Overcoming Barriers: Synchronization Synchronization to avoid “maybe” dependences  Causes serialization Speculate to overcome serialization Recent work on techniques to dynamically elide synchronization constructs

21 Implications Programmer can make liberal use of synchronization to ease programming Little performance impact of synchronization More likely to have parallel programs written

22 Overcoming Barriers: Coherence Caches used to reduce data access latency; need to be kept coherent Latencies of getting value from remote location impact performance Getting remote value is two-part operation  Get value  Get permissions to use value  Can separating these help?

23 Coherence Decoupling Sequential execution Speculative execution Time Coherence miss detected Value predicted Permission granted Value arrives and verified Retried (on misspeculation) Coherence miss detected Permission and value arrive Miss latency Worst case latency Best case latency

24 Zeros/Ones in Coherence Misses Preliminary Data, Simics (Ultrasparc/Solaris, 16P), Cache (4MB 4-way SA L2, 64B lines, MOSI)

25 Other Performance Optimizations Clever techniques for inter-processor communication  Remember: no artificial constraints on chip Further reduction of artificial serialization

26 New Uses for CMPs Helper threads, redundant execution, etc. will need extensive research in the context of CMPs How about trying to parallelize application, i.e., “traditional” use of MPs?

27 Revisiting Traditional Use of MPs Compilers and software for MPs Digression: Master/Slave Speculative Parallelization (MSSP) Expectations for future software Implications

28 Parallelizing Apps: A Moving Target Learned to reason about certain languages, data structures, programming constructs and applications Newer languages, data structures, programming constructs and applications appear Always playing catch up Can we get a jump ahead?

29 Master/Slave Speculative Parallelization (MSSP) Take a program and create program with two sub-programs: Master and Slave Master program is approximate (or distilled) version of original program Slave (original) program “checks” work done by master Portions of the slave program execute in parallel

30 MSSP - Overview A B C D E Master Slave Slave Slave Original Code ADAD BDBD CDCD D EDED Distilled Code Distilled Code on Master Original Code concurrently on Slaves verifies Distilled Code Use checkpoints to communicate changes

31 MSSP - Distiller Program with many paths

32 MSSP - Distiller Program with many paths Dominant paths

33 MSSP - Distiller Program with many paths Dominant paths

34 MSSP - Distiller Program with many paths Dominant paths

35 MSSP - Distiller Approximate important pathsCompiler Optimizations Distilled Code

36 MSSP - Execution B BDBD D D A ADAD C CDCD EDED E Slave Slave Slave Master Verify checkpoint of A D -- Correct Verify checkpoint of B D -- Correct Verify checkpoint of C D -- Correct Verify checkpoint of D D -- Wrong EDED E Restart E and E D

37 MSSP Summary Distill away code that is unlikely to impact state used by later portions of program Performance tracks distillation ratio  Better distillation = better performance Verification of distilled program done in parallel on slaves

38 Future Applications and Software What will future applications look like?  Don’t know What language will they be written in?  Don’t know; don’t care Code for future applications will have “overheads”  Overheads for checking for correctness  Overheads for improving reliability  Overheads for checking security

39 Overheads as an Opportunity Performance costs of overhead have limited their use Overheads not a limit; rather an opportunity Run overhead code in parallel with non-overhead code  Develop programming models  Develop parallel execution models (a la MSSP)  Recent work in this direction Success at reducing overhead cost will encourage even more use of “overhead” techniques

40 Summary New opportunities for innovation in MPs Expect little resemblance between MPs today and CMPs in 15 years  We need to invent and define differences  Not because uniprocessors are running out of steam  But because innovation in CMP multiarchitecture possible

41 Summary Novel techniques for attacking performance limitations New models for expressing work (computation and overhead) New parallel processing models Simulation tools