The Imperative of Disciplined Parallelism: A Hardware Architect’s Perspective Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun.

Slides:



Advertisements
Similar presentations
Objects and Classes David Walker CS 320. Advanced Languages advanced programming features –ML data types, exceptions, modules, objects, concurrency,...
Advertisements

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.
Instruction Set Design
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Nima Honarmand, Rakesh Komuravelli, Robert Smolinski, Hyojin Sung,
Rethinking Shared-Memory Languages and Hardware Sarita V. Adve University of Illinois Acks: M. Hill, K. Gharachorloo, H. Boehm, D. Lea,
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
DeNovo † : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators:
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
The University of Adelaide, School of Computer Science
Continuously Recording Program Execution for Deterministic Replay Debugging.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
System-Level Types for Component-Based Design Paper by: Edward A. Lee and Yuhong Xiong Presentation by: Dan Patterson.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,
Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois Acks: Mark Hill, Kourosh.
Secure Virtual Architecture John Criswell, Arushi Aggarwal, Andrew Lenharth, Dinakar Dhurjati, and Vikram Adve University of Illinois at Urbana-Champaign.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
CSC3315 (Spring 2009)1 CSC 3315 Programming Languages Hamid Harroud School of Science and Engineering, Akhawayn University
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Software & the Concurrency Revolution by Sutter & Larus ACM Queue Magazine, Sept For CMPS Halverson 1.
ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.
Rethinking Hardware and Software for Disciplined Parallelism Sarita V. Adve University of Illinois
DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
Caching in multiprocessor systems Tiina Niklander In AMICT 2009, Petrozavodsk
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
What’s Ahead for Embedded Software? (Wed) Gilsoo Kim
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve with Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.
The University of Adelaide, School of Computer Science
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
CS161 – Design and Architecture of Computer
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
CS161 – Design and Architecture of Computer
Distributed Shared Memory
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Stash: Have Your Scratchpad and Cache it Too
The University of Adelaide, School of Computer Science
Programming Models for SimMillennium
Morgan Kaufmann Publishers
Lecture 13: Large Cache Design I
The University of Adelaide, School of Computer Science
Address Translation for Manycore Systems
CMSC 611: Advanced Computer Architecture
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
자바 언어를 위한 정적 분석 (Static Analyses for Java) ‘99 한국정보과학회 가을학술발표회 튜토리얼
The Challenge of Cross - Language Interoperability
The University of Adelaide, School of Computer Science
Dynamic Verification of Sequential Consistency
The University of Adelaide, School of Computer Science
Presentation transcript:

The Imperative of Disciplined Parallelism: A Hardware Architect’s Perspective Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen Heumann, Nima Honarmand, Rakesh Komuravelli, Maria Kotsifakou, Pablo Montesinos, Tatiana Schpeisman, Matthew Sinclair, Robert Smolinski, Prakalp Srivastava, Hyojin Sung, Adam Welc University of Illinois at Urbana-Champaign, CMU, Intel, Qualcomm

Parallelism Specialization, heterogeneity, … BUT large impact on – Software – Hardware – Hardware-Software Interface Silver Bullets for the Energy Crisis?

Multicore parallelism today: shared-memory – Complex, power- and performance-inefficient hardware Complex directory coherence, unnecessary traffic,... – Difficult programming model Data races, non-determinism, composability?, testing? – Mismatched interface between HW and SW, a.k.a memory model Can’t specify “what value can read return” Data races defy acceptable semantics Multicore Parallelism: Current Practice Fundamentally broken for hardware & software

Specialization/Heterogeneity: Current Practice 6 different ISAs 7 different parallelism models Incompatible memory systems A modern smartphone CPU, GPU, DSP, Vector Units, Multimedia, Audio-Video accelerators Even more broken

How to (co-)design – Software? – Hardware? – HW / SW Interface? Energy Crisis Demands Rethinking HW, SW Deterministic Parallel Java (DPJ) DeNovo Virtual Instruction Set Computing (VISC) Focus on (homogeneous) parallelism

Multicore parallelism today: shared-memory – Complex, power- and performance-inefficient hardware Complex directory coherence, unnecessary traffic,... – Difficult programming model Data races, non-determinism, composability?, testing? – Mismatched interface between HW and SW, a.k.a memory model Can’t specify “what value can read return” Data races defy acceptable semantics Multicore Parallelism: Current Practice Fundamentally broken for hardware & software

Multicore parallelism today: shared-memory – Complex, power- and performance-inefficient hardware Complex directory coherence, unnecessary traffic,... – Difficult programming model Data races, non-determinism, composability?, testing? – Mismatched interface between HW and SW, a.k.a memory model Can’t specify “what value can read return” Data races defy acceptable semantics Multicore Parallelism: Current Practice Fundamentally broken for hardware & software Banish shared memory?

Multicore parallelism today: shared-memory – Complex, power- and performance-inefficient hardware Complex directory coherence, unnecessary traffic,... – Difficult programming model Data races, non-determinism, composability?, testing? – Mismatched interface between HW and SW, a.k.a memory model Can’t specify “what value can read return” Data races defy acceptable semantics Multicore Parallelism: Current Practice Fundamentally broken for hardware & software Banish wild shared memory! Need disciplined shared memory!

Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory?

Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory?

Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory?

Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory?

Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects What is Shared-Memory?

Simple programming model AND Complexity, performance-, power-scalable hardware Our Approach Disciplined Shared Memory Strong safety properties - Deterministic Parallel Java (DPJ) No data races, determinism-by-default, safe non-determinism Simple semantics, safety, and composability Efficiency: complexity, performance, power - DeNovo Simplify coherence and consistency Optimize communication and storage layout explicit effects + structured parallel control explicit effects + structured parallel control

Rethink memory hierarchy for disciplined software – Started with Deterministic Parallel Java (DPJ) – End goal is language-oblivious interface LLVM-inspired virtual ISA Software research strategy – Started with deterministic codes – Added safe non-determinism – Ongoing: other parallel patterns, OS, legacy, … Hardware research strategy – Homogeneous on-chip memory system coherence, consistency, communication, data layout – Ongoing: heterogeneous systems and off-chip memory DeNovo Hardware Project

Rethink memory hierarchy for disciplined software – Started with Deterministic Parallel Java (DPJ) – End goal is language-oblivious interface LLVM-inspired virtual ISA Software research strategy – Started with deterministic codes – Added safe non-determinism – Ongoing: other parallel patterns, OS, legacy, … Hardware research strategy – Homogeneous on-chip memory system coherence, consistency, communication, data layout – Similar ideas apply to heterogeneous and off-chip memory DeNovo Hardware Project

Rethink memory hierarchy for disciplined software – Started with Deterministic Parallel Java (DPJ) – End goal is language-oblivious interface LLVM-inspired virtual ISA Software research strategy – Started with deterministic codes – Added safe non-determinism [ASPLOS’13] – Ongoing: other parallel patterns, OS, legacy, … Hardware research strategy – Homogeneous on-chip memory system coherence, consistency, communication, data layout – Similar ideas apply to heterogeneous and off-chip memory DeNovo Hardware Project

Complexity – Subtle races and numerous transient states in the protocol – Hard to verify and extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Bandwidth waste (cache-line based communication) – Cache pollution (cache-line based allocation) Current Hardware Limitations

Complexity − No transient states − Simple to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Bandwidth waste (cache-line based communication) – Cache pollution (cache-line based allocation) Results for Deterministic Codes Base DeNovo 20X faster to verify vs. MESI Base DeNovo 20X faster to verify vs. MESI

Complexity − No transient states − Simple to extend for optimizations Storage overhead − No storage overhead for directory information Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Bandwidth waste (cache-line based communication) – Cache pollution (cache-line based allocation) Results for Deterministic Codes 20 Base DeNovo 20X faster to verify vs. MESI Base DeNovo 20X faster to verify vs. MESI

Complexity − No transient states − Simple to extend for optimizations Storage overhead − No storage overhead for directory information Performance and power inefficiencies − No invalidation, ack messages − No indirection through directory − No false sharing: region based coherence − Region, not cache-line, communication − Region, not cache-line, allocation (ongoing) Results for Deterministic Codes Up to 79% lower memory stall time Up to 66% lower traffic Up to 79% lower memory stall time Up to 66% lower traffic Base DeNovo 20X faster to verify vs. MESI Base DeNovo 20X faster to verify vs. MESI

Outline Introduction DPJ Overview Base DeNovo Protocol DeNovo Optimizations Evaluation Conclusion and Future Work

DPJ Overview Deterministic-by-default parallel language [OOPSLA’09] – Extension of sequential Java – Structured parallel control: nested fork-join – Novel region-based type and effect system – Speedups close to hand-written Java – Expressive enough for irregular, dynamic parallelism Supports – Disciplined non-determinism [POPL’11] Explicit, data race-free, isolated Non-deterministic, deterministic code co-exist safely (composable) – Unanalyzable effects using trusted frameworks [ECOOP’11] – Unstructured parallelism using tasks w/ effects [PPoPP’13] Focus here on deterministic codes

DPJ Overview Deterministic-by-default parallel language [OOPSLA’09] – Extension of sequential Java – Structured parallel control: nested fork-join – Novel region-based type and effect system – Speedups close to hand-written Java – Expressive enough for irregular, dynamic parallelism Supports – Disciplined non-determinism [POPL’11] Explicit, data race-free, isolated Non-deterministic, deterministic code co-exist safely (composable) – Unanalyzable effects using trusted frameworks [ECOOP’12] – Unstructured parallelism using tasks w/ effects [PPoPP’13] Focus here on deterministic codes

Regions and Effects Region: a name for set of memory locations – Assign region to each field, array cell Effect: read or write on a region – Summarize effects of method bodies Compiler: simple type check – Region types consistent – Effect summaries correct – Parallel tasks don’t interfere heap ST LD Type-checked programs are guaranteed determinism-by-default

Example: A Pair Class class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Declaring and using region names Region names have static scope (one per class)

Example: A Pair Class Writing method effect summaries class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42

Example: A Pair Class Expressing parallelism class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); setY(y); } Pair Pair.BlueX3 Pair.RedY42

Example: A Pair Class Expressing parallelism class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Inferred effects

Outline Introduction Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation Conclusion and Future Work

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa ST 0xa Parallel Phase ST 0xa Coherence Mechanism

Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track one up-to-date copy Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase Invalidates data in writeable regions not accessed by itself Registration – Directory keeps track of one up-to-date copy – Writer updates before next parallel phase

Basic DeNovo Coherence [PACT’11] Assume (for now): Private L1, shared L2; single word line – Data-race freedom at word granularity L2 data arrays double as directory – Keep valid data or registered core id, no space overhead L1/L2 states Touched bit set only if read in the phase registry InvalidValid Registered Read Write

Example Run RX0X0 VY0Y0 RX1X1 VY1Y1 RX2X2 VY2Y2 VX3X3 VY3Y3 VX4X4 VY4Y4 VX5X5 VY5Y5 class S_type { X in DeNovo-region ; Y in DeNovo-region ; } S _type S[size];... Phase1 writes { // DeNovo effect foreach i in 0, size { S[i].X = …; } self_invalidate( ); } L1 of Core 1 RX0X0 VY0Y0 RX1X1 VY1Y1 RX2X2 VY2Y2 IX3X3 VY3Y3 IX4X4 VY4Y4 IX5X5 VY5Y5 L1 of Core 2 IX0X0 VY0Y0 IX1X1 VY1Y1 IX2X2 VY2Y2 RX3X3 VY3Y3 RX4X4 VY4Y4 RX5X5 VY5Y5 Shared L2 RC1VY0Y0 R VY1Y1 R VY2Y2 RC2VY3Y3 R VY4Y4 R VY5Y5 R = Registered V = Valid I = Invalid VX0X0 VY0Y0 VX1X1 VY1Y1 VX2X2 VY2Y2 VX3X3 VY3Y3 VX4X4 VY4Y4 VX5X5 VY5Y5 VX0X0 VY0Y0 VX1X1 VY1Y1 VX2X2 VY2Y2 VX3X3 VY3Y3 VX4X4 VY4Y4 VX5X5 VY5Y5 VX0X0 VY0Y0 VX1X1 VY1Y1 VX2X2 VY2Y2 VX3X3 VY3Y3 VX4X4 VY4Y4 VX5X5 VY5Y5 VX0X0 VY0Y0 VX1X1 VY1Y1 VX2X2 VY2Y2 RX3X3 VY3Y3 RX4X4 VY4Y4 RX5X5 VY5Y5 Registration Ack

Practical DeNovo Coherence Basic protocol impractical – High tag storage overhead (a tag per word) Address/Transfer granularity > Coherence granularity DeNovo Line-based protocol – Traditional software-oblivious spatial locality – Coherence granularity still at word no word-level false-sharing “Line Merging” Cache VVR Tag VVV

Current Hardware Limitations Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Traffic (cache-line based communication) – Cache pollution (cache-line based allocation) ✔ ✔ ✔ ✔

Flexible, Direct Communication Insights 1. Traditional directory must be updated at every transfer DeNovo can copy valid data around freely 2. Traditional systems send cache line at a time DeNovo uses regions to transfer only relevant data Effect of AoS-to-SoA transformation w/o programmer/compiler

Flexible, Direct Communication L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 L1 of Core 2 … … I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 Shared L2 … … R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 LD X 3 Y3Y3 Z3Z3

L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 L1 of Core 2 … … I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 Shared L2 … … R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 X4X4 X5X5 R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 V X3X3 V Y3Y3 V Z3Z3 V X4X4 V Y4Y4 V Z4Z4 V X5X5 V Y5Y5 V Z5Z5 LD X 3 Flexible, Direct Communication

Current Hardware Limitations Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Traffic (cache-line based communication) – Cache pollution (cache-line based allocation) ✔ ✔ ✔ ✔ ✔ ✔ ✔ ongoing

Outline Introduction Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work

Protocol Verification DeNovo vs. MESI word with Murphi model checking Correctness – Six bugs in MESI protocol Difficult to find and fix – Three bugs in DeNovo protocol Simple to fix Complexity – 15x fewer reachable states for DeNovo – 20x difference in the runtime

Performance Evaluation Methodology Simulator: Simics + GEMS + Garnet System Parameters – 64 cores – Simple in-order core model Workloads – FFT, LU, Barnes-Hut, and radix from SPLASH-2 – bodytrack and fluidanimate from PARSEC 2.1 – kd-Tree (two versions) [HPG 09]

FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DW’s performance competitive with MW MESI Word (MW) vs. DeNovo Word (DW)

FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DL about the same or better memory stall time than ML DL outperforms ML significantly with apps with false sharing MESI Line (ML) vs. DeNovo Line (DL)

FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix Combined optimizations perform best – Except for LU and bodytrack – Apps with low spatial locality suffer from line-granularity allocation Optimizations on DeNovo Line

DeNovo has less traffic than MESI in most cases DeNovo incurs more write traffic – due to word-granularity registration – Can be mitigated with “write-combining” optimization FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix Network Traffic

DeNovo has less or comparable traffic than MESI Write combining effective FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix Network Traffic

L1 Fetch Bandwidth Waste Most data brought into L1 is wasted − 40—89% of MESI data fetched is unnecessary − DeNovo+Flex reduces traffic by up to 66% − Can we also reduce allocation waste? Region-driven data layout

Current Hardware Limitations Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Traffic (cache-line based communication) – Cache pollution (cache-line based allocation) ✔ ✔ ✔ ✔ ✔ ✔ ✔ ongoing

Current Hardware Limitations Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation, ack messages – Indirection through directory – False sharing (cache-line based coherence) – Traffic (cache-line based communication) – Cache pollution (cache-line based allocation) ✔ ✔ ✔ ✔ ✔ ✔ ✔ ongoing Region-Driven Memory Hieararchy

Simple programming model AND Complexity, performance-, power-scalable hardware Conclusions and Future Work (1 of 2) Disciplined Shared Memory Strong safety properties - Deterministic Parallel Java (DPJ) No data races, determinism-by-default, safe non-determinism Simple semantics, safety, and composability Efficiency: complexity, performance, power - DeNovo Simplify coherence and consistency Optimize communication and storage layout explicit effects + structured parallel control explicit effects + structured parallel control

Conclusions and Future Work (2 of 2) DeNovo rethinks hardware for disciplined models For deterministic codes Complexity – No transient states: 20X faster to verify than MESI – Extensible: optimizations without new states Storage overhead – No directory overhead Performance and power inefficiencies – No invalidations, acks, false sharing, indirection – Flexible, not cache-line, communication – Up to 79% lower memory stall time, up to 66% lower traffic ASPLOS’13 paper adds safe non-determinism

Broaden software supported – Pipeline parallelism, OS, legacy, … Region-driven memory hierarchy – Also apply to heterogeneous memory Global address space Region-driven coherence, communication, layout Hardware/Software Interface – Language-neutral virtual ISA Parallelism and specialization may solve energy crisis, but – Require rethinking software, hardware, interface – The Disciplined Parallel Programming Imperative Conclusions and Future Work (3 of 3)