Download presentation
Presentation is loading. Please wait.
Published byBarry Charles Modified over 9 years ago
1
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel denovo@cs.illinois.edu
2
Motivation Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics Fundamentally broken for hardware & software
3
Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics Fundamentally broken for hardware & software Motivation Banish shared memory?
4
Goal: Safe and efficient parallel computing – Easy, safe programming model – Complexity-, power-, performance-scalable hardware Today: shared-memory – Complex, power- and performance-inefficient hardware * Complex directory coherence, unnecessary traffic,... – Difficult programming model * Data races, non-determinism, composability/modularity?, testing? – Mismatched interface between HW and SW, a.k.a memory model * Can’t specify “what value can read return” * Data races defy acceptable semantics Fundamentally broken for hardware & software Motivation Banish wild shared memory! Need disciplined shared memory!
5
What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
6
What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
7
What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
8
What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization
9
What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects
10
Benefits of Explicit Effects Disciplined Shared Memory Strong safety properties Determinism-by-default, explicit & safe non-determinism Simple semantics, composability, testability Efficiency: complexity, performance, power Simplify coherence and consistency Optimize communication and storage layout explicit effects + structured parallel control explicit effects + structured parallel control - Deterministic Parallel Java (DPJ) - DeNovo Simple programming model AND Complexity, power, performance-scalable hardware
11
DeNovo Hardware Project Exploit discipline for efficient hardware – Current driver is DPJ (Deterministic Parallel Java) – End goal is language-oblivious interface Software research strategy – Deterministic codes – Safe non-determinism – OS, legacy, … Hardware research strategy – On-chip: coherence, consistency, communication, data layout – Off-chip: similar ideas apply, next step
12
Current Hardware Limitations Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity
13
Complexity – Subtle races and numerous transient sates in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Traffic (fixed cache-line communication) Contributions − No transient states − Simple to extend for optimizations − No storage overhead for directory information − No invalidation and ack messages − No false sharing Current Hardware Limitations − No indirection through the directory − Flexible, not cache-line, communication Up to 81% reduction in memory stall time Up to 81% reduction in memory stall time
14
Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work
15
Background: DPJ [OOPSLA 09] Extension to Java; fully Java-compatible Structured parallel control: nested fork-join style – foreach, cobegin A novel region-based type and effect system Speedups close to hand-written Java programs Expressive enough for irregular, dynamic parallelism
16
Regions and Effects Region: a name for a set of memory locations – Programmer assigns a region to each field and array cell – Regions partition the heap Effect: a read or write on a region – Programmer summarizes effects of method bodies Compiler checks that – Region types are consistent, effect summaries are correct – Parallel tasks are non-interfering (no conflicts) – Simple, modular type checking (no inter-procedural ….) Programs that type-check are guaranteed determinism
17
Example: A Pair Class class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Declaring and using region names Region names have static scope (one per class)
18
Example: A Pair Class Writing method effect summaries class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42
19
Example: A Pair Class Expressing parallelism class Pair { region Blue, Red; int X in Blue; int Y in Red; void setX(int x) writes Blue { this.X = x; } void setY(int y) writes Red { this.Y = y; } void setXY(int x, int y) writes Blue; writes Red { cobegin { setX(x); // writes Blue setY(y); // writes Red } Pair Pair.BlueX3 Pair.RedY42 Inferred effects
20
Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work
21
Memory Consistency Model Guaranteed determinism Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa ST 0xa Parallel Phase ST 0xa
22
Memory Consistency Model Guaranteed determinism Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa Coherence Mechanism ST 0xa Parallel Phase
23
Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track up-to-date copy Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase * Invalidates data in writeable regions not accessed by itself Registration – Directory keeps track of one up-to-date copy – Writer updates before next parallel phase
24
Basic DeNovo Coherence Assume (for now): Private L1, shared L2; single word line – Data-race freedom at word granularity No space overhead – Keep valid data or registered core id – L2 data arrays double as directory No transient states registry InvalidValid Registered Read Write
25
Example Run class Pair { X in DeNovo-region Blue; Y in DeNovo-region Red; void setX(int x) writes Blue { this.X = x; } Pair PArray[size];... Phase1 writes Blue { // DeNovo effect foreach i in 0, size { PArray[i].setX(…); } self_invalidate(Blue); } R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 L1 of Core 1 R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 I X3X3 V Y3Y3 I X4X4 V Y4Y4 I X5X5 V Y5Y5 L1 of Core 2 I X0X0 V Y0Y0 I X1X1 V Y1Y1 I X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Shared L2 Registered Valid Invalid V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Registration Ack V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 R C1 V Y0Y0 R V Y1Y1 R V Y2Y2 R C2 V Y3Y3 R V Y4Y4 R V Y5Y5
26
Practical DeNovo Coherence Basic protocol impractical – High tag storage overhead (a tag per word) Address/Transfer granularity > Coherence granularity DeNovo Line-based protocol – Traditional software-oblivious spatial locality – Coherence granularity still at word * no word-level false-sharing “Line Merging” Cache VVR Tag VVV
27
Current Hardware Limitations ✔ ✔ ✔ ✔ Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity
28
Protocol Optimizations Insights 1.Traditional directory must be updated at every transfer ⇒ DeNovo can copy valid data around freely 1.Traditional systems send cache line at a time ⇒ DeNovo uses regions to transfer only relevant data
29
L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 LD X 3 Y3Y3 Z3Z3 Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 … … Shared L2 … …
30
L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid Protocol Optimizations Direct cache-to-cache transfer Flexible communication L1 of Core 2 … … Shared L2 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 V X3X3 V Y3Y3 V Z3Z3 V X4X4 V Y4Y4 V Z4Z4 V X5X5 V Y5Y5 V Z5Z5 X3X3 X4X4 X5X5 LD X 3 Direct cache-to-cache transfer Flexible communication
31
Current Hardware Limitations ✔ ✔ ✔ ✔ ✔ ✔ ✔ Complexity – Subtle races and numerous transient states in the protocol – Hard to extend for optimizations Storage overhead – Directory overhead for sharer lists Performance and power inefficiencies – Invalidation and ack messages – False sharing – Indirection through the directory – Fixed cache-line communication granularity
32
Outline Motivation Background: DPJ Base DeNovo Protocol DeNovo Optimizations Evaluation – Complexity – Performance Conclusion and Future Work
33
Protocol Verification DeNovo vs. MESI word with Murphi model checking Correctness – Six bugs in MESI protocol * Difficult to find and fix – Three bugs in DeNovo protocol * Simple to fix Complexity – 15x fewer reachable states for DeNovo – 20x difference in the runtime
34
Performance Evaluation Methodology Simulator: Simics + GEMS + Garnet System Parameters – 64 cores – Simple in-order core model Workloads – FFT, LU, Barnes-Hut, and radix from SPLASH-2 – bodytrack and fluidanimate from PARSEC 2.1 – kd-Tree (two versions) [HPG 09]
35
MESI Word (MW) vs. DeNovo Word (DW) FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DW’s performance competitive with MW
36
MESI Line (ML) vs. DeNovo Line (DL) FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix DL about the same or better memory stall time than ML DL outperforms ML significantly with apps with false sharing
37
Optimizations on DeNovo Line FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix Combined optimizations perform best – Except for LU and bodytrack – Apps with low spatial locality suffer from line-granularity allocation
38
Network Traffic DeNovo has less traffic than MESI in most cases DeNovo incurs more write traffic – due to word-granularity registration – Can be mitigated with “write-combining” optimization FFTLUkdFalse kdPadded Barnes bodytrackfluidanimateradix
39
Conclusion and Future Work DeNovo rethinks hardware for disciplined models Complexity – No transient states: 20X faster to verify than MESI – Extensible: optimizations without new states Storage overhead – No directory overhead Performance and power inefficiencies – No invalidations, acks, false sharing, indirection – Flexible, not cache-line, communication – Up to 81% reduction in memory stall time Future work: region-driven layout, off-chip memory, safe non-determinism, sync, OS, legacy
40
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.