DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,

Slides:

Advertisements

Similar presentations

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Advertisements

Semantic Analysis Chapter 6. Two Flavors  Static (done during compile time) –C –Ada  Dynamic (done during run time) –LISP –Smalltalk  Optimization.

Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Nima Honarmand, Rakesh Komuravelli, Robert Smolinski, Hyojin Sung,

Rethinking Shared-Memory Languages and Hardware Sarita V. Adve University of Illinois Acks: M. Hill, K. Gharachorloo, H. Boehm, D. Lea,

Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.

DeNovo † : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators:

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

Continuously Recording Program Execution for Deterministic Replay Debugging.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.

Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois Acks: Mark Hill, Kourosh.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Using Model-Checking to Debug Device Firmware Sanjeev Kumar Microprocessor Research Labs, Intel Kai Li Princeton University.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Software & the Concurrency Revolution by Sutter & Larus ACM Queue Magazine, Sept For CMPS Halverson 1.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.

TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

The Imperative of Disciplined Parallelism: A Hardware Architect’s Perspective Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun.

Rethinking Hardware and Software for Disciplined Parallelism Sarita V. Adve University of Illinois

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Caching in multiprocessor systems Tiina Niklander In AMICT 2009, Petrozavodsk

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

What’s Ahead for Embedded Software? (Wed) Gilsoo Kim

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve with Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.

Region-Based Software Distributed Shared Memory Song Li, Yu Lin, and Michael Walker CS Operating Systems May 1, 2000.

The University of Adelaide, School of Computer Science

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Distributed Shared Memory

Stash: Have Your Scratchpad and Cache it Too

A New Coherence Method Using A Multicast Address Network

The University of Adelaide, School of Computer Science

Address Translation for Manycore Systems

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Outline Midterm results summary Distributed file systems – continued

Lecture 9: Directory Protocol Implementations

Semantic Analysis Chapter 6.

자바 언어를 위한 정적 분석 (Static Analyses for Java) ‘99 한국정보과학회 가을학술발표회 튜토리얼

Dynamic Verification of Sequential Consistency

Presentation transcript:

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel

Motivation Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics  Fundamentally broken for hardware & software

Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics  Fundamentally broken for hardware & software Motivation Banish shared memory?

Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics  Fundamentally broken for hardware & software Motivation Banish wild shared memory! Need disciplined shared memory!

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization

What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects

Benefits of Explicit Effects Disciplined Shared Memory Strong safety properties Determinism-by-default, explicit & safe non-determinism Simple semantics, composability, testability Efficiency: complexity, performance, power Simplify coherence and consistency Optimize communication and storage layout explicit effects + structured parallel control explicit effects + structured parallel control - Deterministic Parallel Java (DPJ) - DeNovo Simple programming model AND Complexity, power, performance-scalable hardware

DeNovo Hardware Project Exploit discipline for efficient hardware – Current driver is DPJ (Deterministic Parallel Java) – End goal is language-oblivious interface Software research strategy – Deterministic codes – Safe non-determinism – OS, legacy, … Hardware research strategy – On-chip: coherence, consistency, communication, data layout – Off-chip: similar ideas apply, next step

Current Hardware Limitations Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity

Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Traffic (fixed cache-line communication) Contributions − No transient states − Simple to extend for optimizations − No storage overhead for directory information − No invalidation and ack messages − No false sharing Current Hardware Limitations − No indirection through the directory − Flexible, not cache-line, communication Up to 81% reduction in memory stall time Up to 81% reduction in memory stall time

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work

Background: DPJ [OOPSLA 09] Extension to Java; fully Java-compatible Structured parallel control: nested fork-join style – foreach, cobegin A novel region-based type and effect system Speedups close to hand-written Java programs Expressive enough for irregular, dynamic parallelism

Regions and Effects Region: a name for a set of memory locations – Programmer assigns a region to each field and array cell – Regions partition the heap Effect: a read or write on a region – Programmer summarizes effects of method bodies Compiler checks that – Region types are consistent, effect summaries are correct – Parallel tasks are non-interfering (no conflicts) – Simple, modular type checking (no inter-procedural ….) Programs that type-check are guaranteed determinism

Example: A Pair Class class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Declaring and using region names Region names have static scope (one per class)

Example: A Pair Class Writing method effect summaries class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42

Example: A Pair Class Expressing parallelism class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Inferred effects

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa ST 0xa Parallel Phase ST 0xa

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa Coherence Mechanism ST 0xa Parallel Phase

Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track up-to-date copy Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase * Invalidates data in writeable regions not accessed by itself Registration – Directory keeps track of one up-to-date copy – Writer updates before next parallel phase

Basic DeNovo Coherence Assume (for now): Private L1, shared L2; single word line – Data-race freedom at word granularity No space overhead – Keep valid data or registered core id – L2 data arrays double as directory No transient states registry InvalidValid Registered Read Write

Example Run class Pair { X in DeNovo-region Blue; Y in DeNovo-region Red; void setX(int x) writes Blue { this.X = x; } Pair PArray[size];... Phase1 writes Blue { // DeNovo effect foreach i in 0, size { PArray[i].setX(…); } self_invalidate(Blue); } R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 L1 of Core 1 R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 I X3X3 V Y3Y3 I X4X4 V Y4Y4 I X5X5 V Y5Y5 L1 of Core 2 I X0X0 V Y0Y0 I X1X1 V Y1Y1 I X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Shared L2 Registered Valid Invalid V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Registration Ack V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 R C1 V Y0Y0 R V Y1Y1 R V Y2Y2 R C2 V Y3Y3 R V Y4Y4 R V Y5Y5

Practical DeNovo Coherence Basic protocol impractical – High tag storage overhead (a tag per word) Address/Transfer granularity > Coherence granularity DeNovo Line-based protocol – Traditional software-oblivious spatial locality – Coherence granularity still at word * no word-level false-sharing “Line Merging” Cache VVR Tag VVV

Current Hardware Limitations ✔ ✔ ✔ ✔ Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity

Protocol Optimizations Insights 1.Traditional directory must be updated at every transfer ⇒ DeNovo can copy valid data around freely 1.Traditional systems send cache line at a time ⇒ DeNovo uses regions to transfer only relevant data

L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 LD X 3 Y3Y3 Z3Z3 Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 … … Shared L2 … …

L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 … … Shared L2 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 V X3X3 V Y3Y3 V Z3Z3 V X4X4 V Y4Y4 V Z4Z4 V X5X5 V Y5Y5 V Z5Z5 X3X3 X4X4 X5X5 LD X 3 Direct cache-to-cache transfer Flexible communication

Current Hardware Limitations ✔ ✔ ✔ ✔ ✔ ✔ ✔ Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity

Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work

Protocol Verification DeNovo vs. MESI word with Murphi model checking Correctness – Six bugs in MESI protocol * Difficult to find and fix – Three bugs in DeNovo protocol * Simple to fix Complexity – 15x fewer reachable states for DeNovo – 20x difference in the runtime

Performance Evaluation Methodology Simulator: Simics + GEMS + Garnet System Parameters – 64 cores – Simple in-order core model Workloads – FFT, LU, Barnes-Hut, and radix from SPLASH-2 – bodytrack and fluidanimate from PARSEC 2.1 – kd-Tree (two versions) [HPG 09]

MESI Word (MW) vs. DeNovo Word (DW) FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DW’s performance competitive with MW

MESI Line (ML) vs. DeNovo Line (DL) FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DL about the same or better memory stall time than ML DL outperforms ML significantly with apps with false sharing

Optimizations on DeNovo Line FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix Combined optimizations perform best – Except for LU and bodytrack – Apps with low spatial locality suffer from line-granularity allocation

Network Traffic DeNovo has less traffic than MESI in most cases DeNovo incurs more write traffic – due to word-granularity registration – Can be mitigated with “write-combining” optimization FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix

Conclusion and Future Work DeNovo rethinks hardware for disciplined models Complexity – No transient states: 20X faster to verify than MESI – Extensible: optimizations without new states Storage overhead – No directory overhead Performance and power inefficiencies – No invalidations, acks, false sharing, indirection – Flexible, not cache-line, communication – Up to 81% reduction in memory stall time Future work: region-driven layout, off-chip memory, safe non-determinism, sync, OS, legacy

Thank You!