Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
High Performing Cache Hierarchies for Server Workloads
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
From Yale-45 to Yale-90: Let Us Not Bother the Programmers Guri Sohi University of Wisconsin-Madison Celebrating September 19, 2014.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
EECC722 - Shaaban #1 lec # 10 Fall A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
Parallel Execution Models for Future Multicore Architectures Guri Sohi University of Wisconsin.
Single-Chip Multiprocessors: the Rebirth of Parallel Architecture Guri Sohi University of Wisconsin.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 17 Parallel Processing.
Multiscalar processors
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Simultaneous Multithreading:Maximising On-Chip Parallelism Dean Tullsen, Susan Eggers, Henry Levy Department of Computer Science, University of Washington,Seattle.
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer System Architectures Computer System Software
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
Multi-Core Architectures
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Hardware Multithreading. Increasing CPU Performance By increasing clock frequency By increasing Instructions per Clock Minimizing memory access impact.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Background Computer System Architectures Computer System Software.
A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
COMP 740: Computer Architecture and Implementation
18-447: Computer Architecture Lecture 30B: Multiprocessors
Computer Architecture: Parallel Processing Basics
CS5102 High Performance Computer Systems Thread-Level Parallelism
Multiscalar Processors
Lecture 14: Reducing Cache Misses
ISCA 2005 Panel Guri Sohi.
Single-Chip Multiprocessors: the Rebirth of Parallel Architecture
Chapter 4 Multiprocessors
Co-designed Virtual Machines for Reliable Computer Systems
Dynamic Verification of Sequential Consistency
Presentation transcript:

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin

2 Outline Waves of innovation in architecture Innovation in uniprocessors Opportunities for innovation Example CMP innovations  Parallel processing models  CMP memory hierarcies

3 Waves of Research and Innovation A new direction is proposed or new opportunity becomes available The center of gravity of the research community shifts to that direction  SIMD architectures in the 1960s  HLL computer architectures in the 1970s  RISC architectures in the early 1980s  Shared-memory MPs in the late 1980s  OOO speculative execution processors in the 1990s

4 Waves Wave is especially strong when coupled with a “step function” change in technology  Integration of a processor on a single chip  Integration of a multiprocessor on a chip

5 Uniprocessor Innovation Wave Integration of processor on a single chip  The inflexion point  Argued for different architecture (RISC) More transistors allow for different models  Speculative execution Then the rebirth of uniprocessors  Continue the journey of innovation  Totally rethink uniprocessor microarchitecture

6 The Next Wave Can integrate a simple multiprocessor on a chip  Basic microarchitecture similar to traditional MP Rethink the microarchitecture and usage of chip multiprocessors

7 Broad Areas for Innovation Overcoming traditional barriers New opportunities to use CMPs for parallel/multithreaded execution Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)

8 Remainder of Talk Roadmap Summary of traditional parallel processing Revisiting traditional barriers and overheads to parallel processing Novel CMP applications and workloads Novel CMP microarchitectures

9 Multiprocessor Architecture Take state-of-the-art uniprocessor Connect several together with a suitable network  Have to live with defined interfaces Expend hardware to provide cache coherence and streamline inter-node communication  Have to live with defined interfaces

10 Software Responsibilities Reason about parallelism, execution times and overheads  This is hard Use synchronization to ease reasoning  Parallel trends towards serial with the use of synchronization Very difficult to parallelize transparently

11 Net Result Difficult to get parallelism speedup  Computation is serial  Inter-node communication latencies exacerbate problem Multiprocessors rarely used for parallel execution Typical use: improve throughput This will have to change  Will need to rethink “parallelization”

12 Rethinking Parallelization Speculative multithreading Speculation to overcoming other performance barriers Revisiting computation models for parallelization New parallelization opportunities  New types of workloads (e.g., multimedia)  New variants of parallelization models

13 Speculative Multithreading Speculatively parallelize an application  Use speculation to overcome ambiguous dependences  Use hardware support to recover from mis- speculation E.g., multiscalar Use hardware to overcome barriers

14 Overcoming Barriers: Memory Models Weak models proposed to overcome performance limitations of SC Speculation used to overcome “maybe” dependences Series of papers showing SC can achieve performance of weak models

15 Implications Strong memory models not necessarily low performance Programmer does not have to reason about weak models More likely to have parallel programs written

16 Overcoming Barriers: Synchronization Synchronization to avoid “maybe” dependences  Causes serialization Speculate to overcome serialization Recent work on techniques to dynamically elide synchronization constructs

17 Implications Programmer can make liberal use of synchronization to ease programming Little performance impact of synchronization More likely to have parallel programs written

18 Revisiting Parallelization Models Transactions  simplify writing of parallel code  very high overhead to implement semantics in software Hardware support for transactions will exist  Speculative multithreading is ordered transactions  No software overhead to implement semantics More applications likely to be written with transactions

19 New Opportunities for CMPs New opportunities for parallelism  Emergence of multimedia workloads Amenable to traditional parallel processing  Parallel execution of overhead code  Program demultiplexing  Separating user and OS

20 New Opportunities for CMPs New opportunities for microarchitecture innovation  Instruction memory hierarchies  Data memory hierarchies

21 Reliable Systems Software is unreliable and error-prone Hardware will be unreliable and error-prone Improving hardware/software reliability will result in significant software redundancy  Redundancy will be source of parallelism

22 Software Reliability & Security Reliability & Security via Dynamic Monitoring -Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… -VM performs checks for Java/C# High Overheads! -Encourages use of unsafe code

23 Software Reliability & Security - Divide program into tasks -Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A B C D A’ B’ C’ Monitoring Code Program

24 Software Reliability & Security -Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks C D B B’ A A’ COMMIT C’ ABORT C’ D

25 Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow

26 Software Reliability & Security

27 Program De-multiplexing New opportunities for parallelism  2-4X parallelism Program is a multiplexing of methods (or functions) onto single control flow De-multiplex methods of program Execute methods in parallel in dataflow fashion

28 Data-flow Execution Data-flow Machines - Dataflow in programs - No control flow, PC - Nodes are instrs. on FU OoO Superscalar Machines - Sequential programs - Limited dataflow with ROB - Nodes are instrs. on FU 6 7

29 Program Demultiplexing Demux’ed Execution Nodes - methods (M) Processors - FUs Sequential Program

30 Program Demultiplexing Sequential Program Triggers - usually fire after data dep of M - Chosen with software support - Handlers generate params PD - Data-flow based spec. execution - Differs from control-flow spec. 6 On Trigger- Execute On Call- Use Demux’ed exec. (6)

31 Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M

32 Impact of Parallelization Expect different characteristics for code on each core More reliance on inter-core parallelism Less reliance on intra-core parallelism May have specialized cores

33 Microarchitectural Implications Processor Cores  Skinny, less complex Will SMT go away?  Perhaps specialized Memory structures (i-caches, TLBs, d- caches)  Different organizations possible

34 Microarchitectural Implications Novel memory hierarchy opportunities  Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth Pressure on non-core techniques to tolerate longer latencies Helper threads, pre-execution

35 Instruction Memory Hierarchy Computation spreading

36 Conventional CMP Organizations P L1I L1D L2 Interconnect Network P L1I L1D L2 P L1I L1D L2 P L1I L1D L2 ……

37 Code Commonality Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity Poor utilization of L1 I-cache due to replication

38 Removing Duplication Avoiding duplication significantly reduces misses Shared L1 I-cache may be impractical MPKI (64K)

39 Computation Spreading Avoid reference spreading by distributing the computation based on code regions  Each processor is responsible for a particular region of code (for multiple threads)  References to a particular code region is localized  Computation from one thread is carried out in different processors

P1P2P3 AB’C’’ BC’A’’ CA’B’’ P1P2P3 AB’C’’ A’BC’ A’’B’’C T1T2T3 ABCABC B’ C’ A’ C’’ A’’ B’’ Canonical Model Computation Spreading Example TIME

41 L1 cache performance Significant reduction in instruction misses Data Cache Performance is deteriorated  Migration of computation disrupts locality of data reference

42 Data Memory Hierarchy Co-operative caching

43 Conventional CMP Organizations P L1I L1D L2 Interconnect Network P L1I L1D L2 P L1I L1D L2 P L1I L1D L2 ……

44 Conventional CMP Organizations P L1I L1D Interconnect Network P L1I L1D P L1I L1D P L1I L1D …… Shared L2

45 A Spectrum of CMP Cache Designs Shared Caches Private Caches Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) Cooperative Caching Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) …

46 CMP Cooperative Caching Basic idea: Private caches cooperatively form an aggregate global cache  Use private caches for fast access / isolation  Share capacity through cooperation  Mitigate interference via cooperation throttling Inspired by cooperative file/web caches P L1IL1D L2 P L1IL1D L2 L1IL1DL1IL1D PP

47 Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core Shared (LRU) Most reduction C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads MPKI (private): Reduction of off-chip access rate

48 Cycles Spent on L1 Misses Trace-based simulation: no miss overlapping Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses PCFS OLTPApacheECperfJBBBarnesOceanMix1Mix2 PCFS P:Private C:C-Clean F:C-1Fwd S:Shared

49 Summary Start of a new wave of innovation to rethink parallelism and parallel architecture New opportunities for innovation in CMPs New opportunities for parallelizing applications Expect little resemblance between MPs today and CMPs in 15 years We need to invent and define differences