DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Nima Honarmand, Rakesh Komuravelli, Robert Smolinski, Hyojin Sung,

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Cache Optimization Summary

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Ensuring Operating System Kernel Integrity with OSck By Owen S. Hofmann Alan M. Dunn Sangman Kim Indrajit Roy Emmett Witchel Kent State University College.

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita Adve University of Illinois Acks: Mark Hill, Kourosh Gharachorloo,

Rethinking Shared-Memory Languages and Hardware Sarita V. Adve University of Illinois Acks: M. Hill, K. Gharachorloo, H. Boehm, D. Lea,

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

DeNovo † : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators:

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

The University of Adelaide, School of Computer Science

Continuously Recording Program Execution for Deterministic Replay Debugging.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

DISTRIBUTED CONSISTENCY MANAGEMENT IN A SINGLE ADDRESS SPACE DISTRIBUTED OPERATING SYSTEM Sombrero.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Multiprocessor Cache Coherency

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand,

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Advances in Language Design

Memory Models: A Case for Rethinking Parallel Languages and Hardware † Sarita V. Adve University of Illinois Acks: Mark Hill, Kourosh.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Using Model-Checking to Debug Device Firmware Sanjeev Kumar Microprocessor Research Labs, Intel Kai Li Princeton University.

ROBERT BOCCHINO, ET AL. UNIVERSAL PARALLEL COMPUTING RESEARCH CENTER UNIVERSITY OF ILLINOIS A Type and Effect System for Deterministic Parallel Java *Based.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Lecture 13: Multiprocessors Kai Bu

The Imperative of Disciplined Parallelism: A Hardware Architect’s Perspective Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun.

Rethinking Hardware and Software for Disciplined Parallelism Sarita V. Adve University of Illinois

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve, Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.

1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.

1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)

1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

DeNovo: A Software-Driven Rethinking of the Memory Hierarchy Sarita Adve with Vikram Adve, Rob Bocchino, Nicholas Carter, Byn Choi, Ching-Tsun Chou, Stephen.

The University of Adelaide, School of Computer Science

Software Coherence Management on Non-Coherent-Cache Multicores

Crossing Guard: Mediating Host-Accelerator Coherence Interactions

Distributed Shared Memory

Stash: Have Your Scratchpad and Cache it Too

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Address Translation for Manycore Systems

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

The University of Adelaide, School of Computer Science

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture 9: Directory Protocol Implementations

High Performance Computing

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Nima Honarmand, Rakesh Komuravelli, Robert Smolinski, Hyojin Sung, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter , Ching-Tsun Chou  University of Illinois, Urbana-Champaign,  Intel

Motivation Goal: Power-, complexity-, performance-scalable hardware Today: shared-memory – Directory-based coherence Complex and unscalable – Difficult programming model Data races, non-determinism, no safety/composability/modularity – Mismatched interface between HW and SW, a.k.a memory model Can’t specify “what value can read return” Data races defy acceptable semantics Fundamentally broken for hardware and software

Solution? Banish shared memory? As you scale the number of cores on a cache coherent system (CC), “cost” in “time and memory” grows to a point beyond which the additional cores are not useful in a single parallel program. This is the coherency wall…. A case for message-passing for many-core computing, Timothy Mattson et al

Solution The problems are not inherent to shared memory paradigm Global address space Implicit, unstructured communication and synchronization Shared Memory

Solution Banish wild shared memory! Global address space Implicit, unstructured communication and synchronization Wild Shared Memory

Implicit, unstructured communication and synchronization Solution Build disciplined shared memory! Global address space Explicit, structured side-effects Disciplined Shared Memory

No data races, determinism-by-default, safe non-determinism Simple semantics, safety and composability Simple coherence and consistency Software-aware address/comm/coherence granularity Power-, complexity-, performance-scalable HW explicit effects + structured parallel control explicit effects + structured parallel control

Research Strategy: Software Scope Many systems provide disciplined shared-memory features Current driver is DPJ (Deterministic Parallel Java) – Determinism-by-default, safe non-determinism End goal is language-oblivious interface Current focus on deterministic codes – Common and best case Extend later to safe non-determinism, legacy codes

Research Strategy: Hardware Scope Today’s focus: cache coherence Limitations of current directory-based protocols 1.Complexity Subtle races and numerous transient sates in the protocol Hard to extend for optimizations 2.Storage overhead Directory overhead for sharer lists 3.Performance and power inefficiencies Invalidation and ack messages False sharing Indirection through the directory Suboptimal comm. granularity of cache line … Ongoing: Rethink communication architecture and data layout

Contributions Simplicity – Compared protocol complexity with MESI – 25x less reachable states for model checking Extensibility – Direct cache-to-cache transfer – Flexible communication granularity Storage overhead – No storage overhead for directory information – Storage overheads beat MESI after tens of cores and scale beyond Performance/Power – Up to 73% reduction in memory stall time – Up to 70% reduction in network traffic

Contributions Simplicity – Compared protocol complexity with MESI – 25x less reachable states with model checking Extensibility – Direct cache-to-cache transfer – Flexible communication granularity Storage overhead – No storage overhead for directory information – Storage overheads beat MESI after tens of cores and scale beyond Performance/Power – Up to 73% reduction in memory stall time – Up to 70% reduction in network traffic

Contributions Simplicity – Compared protocol complexity with MESI – 25x less reachable states with model checking Extensibility – Direct cache-to-cache transfer – Flexible communication granularity Storage overhead – No storage overhead for directory information – Storage overheads beat MESI after tens of cores and scale beyond Performance/Power – Up to 73% reduction in memory stall time – Up to 70% reduction in network traffic

Contributions Simplicity – Compared protocol complexity with MESI – 25x less reachable states with model checking Extensibility – Direct cache-to-cache transfer – Flexible communication granularity Storage overhead – No storage overhead for directory information – Storage overheads beat MESI after tens of cores and scale beyond Performance/Power – Up to 73% reduction in memory stall time – Up to 70% reduction in network traffic

Contributions Simplicity – Compared protocol complexity with MESI – 25x less reachable states with model checking Extensibility – Direct cache-to-cache transfer – Flexible communication granularity Storage overhead – No storage overhead for directory information – Storage overheads beat MESI after tens of cores and scale beyond Performance/Power – Up to 73% reduction in memory stall time – Up to 70% reduction in network traffic

Outline Motivation Background: DPJ DeNovo Protocol DeNovo Extensions Evaluation – Protocol Verification – Performance Conclusion and Future Work

Background: DPJ Extension for modern OO languages Structured parallel control: nested fork-join style – foreach, cobegin Type and effect system ensures non-interference – Region names: partition heap – Effects for methods: which regions read/written in method – Type checking ensures race-freedom for parallel tasks  Deterministic execution Recent support for safe non-determinism

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa ST 0xa Parallel Phase ST 0xa

Memory Consistency Model Guaranteed determinism  Read returns value of last write in sequential order 1.Same task in this parallel phase 2.Or before this parallel phase LD 0xa Coherence Mechanism ST 0xa Parallel Phase

Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track up-to-date copy

Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track up-to-date copy Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase Invalidates data in writeable regions not accessed by itself

Cache Coherence Coherence Enforcement 1.Invalidate stale copies in caches 2.Track up-to-date copy Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase Invalidates data in writeable regions not accessed by itself Registration – Directory keeps track of one up-to-date copy – Writer updates before next parallel phase

Basic DeNovo Coherence Assume (for now): Private L1, shared L2; single word line – Data-race freedom at word granularity L2 data arrays double as directory – Keep valid data or registered core id, no space overhead L1/L2 states “Touched” bit – set only if read in the phase registry InvalidValid Registered Read Write

Example Run R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 class S_type { X in DeNovo-region ; Y in DeNovo-region ; } S _type S[size];... Phase1 writes { // DeNovo effect foreach i in 0, size { S[i].X = …; } self_invalidate( ); } L1 of Core 1 R X0X0 V Y0Y0 R X1X1 V Y1Y1 R X2X2 V Y2Y2 I X3X3 V Y3Y3 I X4X4 V Y4Y4 I X5X5 V Y5Y5 L1 of Core 2 I X0X0 V Y0Y0 I X1X1 V Y1Y1 I X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Shared L2 R C1 V Y0Y0 R V Y1Y1 R V Y2Y2 R C2 V Y3Y3 R V Y4Y4 R V Y5Y5 Registered Valid Invalid V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 V X3X3 V Y3Y3 V X4X4 V Y4Y4 V X5X5 V Y5Y5 V X0X0 V Y0Y0 V X1X1 V Y1Y1 V X2X2 V Y2Y2 R X3X3 V Y3Y3 R X4X4 V Y4Y4 R X5X5 V Y5Y5 Registration Ack

Addressing Limitations Limitations of current directory-based protocols 1.Complexity Subtle races and numerous transient sates in the protocol Hard to extend for optimizations 2.Storage overhead Directory overhead for sharer lists 3.Performance and power overhead Invalidation and ack messages False-sharing Indirection through the directory ✔ ✔ ✔

Practical DeNovo Coherence Basic protocol impractical – High tag storage overhead (a tag per word) Address/Transfer granularity > Coherence granularity DeNovo Line-based protocol – Traditional software-oblivious spatial locality – Coherence granularity still at word no word-level false-sharing “Line Merging” Cache VVR Tag VVV

Storage Overhead DeNovo line (64 bytes/line) – L1: 2 state bits, 1 touched bit, 5 region bits (128 bits/line) – L2: 1 valid/line, 1 dirty/line, 1 state bit/word (18 bits/line) MESI (full-map, in-cache directory) – L1: 5 state bits (5 bits/line) – L2: 5 state bits, [# of cores for directory] bits (5+P bits/line) Size of L2 == 8 x [aggregate size of L1s]

Storage Overhead 29 Cores

Alternative Directory Designs Duplicate Tag Directory – L1 tags duplicated in directory – Requires highly-associative lookups 256-way associative lookup for 64-core setup with 4-way L1s Sparse Directory – Significant performance overhead with higher core-counts Tagless Directory (MICRO ’09) – Imprecise sharer-list encoding using bloom-filters – Even more complexity in the protocol (false-positives) – Trade performance for less storage overhead/power

Addressing Limitations Limitations of current directory-based protocols 1.Complexity Subtle races and numerous transient sates in the protocol Hard to extend for optimizations 2.Storage overhead Directory overhead for sharer lists 3.Performance and power overhead Invalidation and ack messages False-sharing Indirection through the directory ✔ ✔ ✔ ✔

Extensions Traditional directory-based protocols  Sharer-lists always contain all the true sharers DeNovo protocol  Registry points to latest copy at end of phase  Valid data can be copied around freely

Extensions (1 of 2) Basic with Direct cache-to-cache transfer – Get data directly from producer – Through prediction and/or software-assistance – Convert 3-hop misses to 2-hop misses L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 X3X3 L1 of Core 2 … … I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 Shared L2 … … R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid LD X 3

Extensions (2 of 2) Basic with Flexible communication – Software-directed data transfer – Transfer “relevant” data together – Achieve effects of AoS-to-SoA transformation without programmer/compiler intervention L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 L1 of Core 2 … … I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 Shared L2 … R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid

Extensions (2 of 2) Basic with Flexible communication – Software-directed data transfer – Transfer “relevant” data together – Achieve effects of AoS-to-SoA transformation without programmer/compiler intervention L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 L1 of Core 2 … … I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 Shared L2 … … R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 X4X4 X5X5 LD X 3 LD X 4 LD X 5 Y3Y3 Y4Y4 Y5Y5 Z3Z3 Z4Z4 Z5Z5

Extensions (2 of 2) Basic with Flexible communication – Software-directed data transfer – Transfer “relevant” data together – Achieve effects of AoS-to-SoA transformation without programmer/compiler intervention L1 of Core 1 … … R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 I X3X3 V Y3Y3 V Z3Z3 I X4X4 V Y4Y4 V Z4Z4 I X5X5 V Y5Y5 V Z5Z5 L1 of Core 2 … … I X0X0 V Y0Y0 V Z0Z0 I X1X1 V Y1Y1 V Z1Z1 I X2X2 V Y2Y2 V Z2Z2 R X3X3 V Y3Y3 V Z3Z3 R X4X4 V Y4Y4 V Z4Z4 R X5X5 V Y5Y5 V Z5Z5 Shared L2 … … R C1 V Y0Y0 V Z0Z0 R V Y1Y1 V Z1Z1 R V Y2Y2 V Z2Z2 R C2 V Y3Y3 V Z3Z3 R V Y4Y4 V Z4Z4 R V Y5Y5 V Z5Z5 R egistered V alid I nvalid X3X3 X4X4 X5X5 LD X 3 R X0X0 V Y0Y0 V Z0Z0 R X1X1 V Y1Y1 V Z1Z1 R X2X2 V Y2Y2 V Z2Z2 V X3X3 V Y3Y3 V Z3Z3 V X4X4 V Y4Y4 V Z4Z4 V X5X5 V Y5Y5 V Z5Z5

Outline Motivation Background: DPJ DeNovo Protocol DeNovo Extensions Evaluation – Protocol Verification – Performance Conclusion and Future Work

Evaluation Simplicity – Formal verification of the cache coherence protocol – Comparing reachable states Performance/Power – Simulation experiments Extensibility – DeNovo extensions

Protocol Verification Methodology Murphi model checking tool Verified DeNovo word and MESI word protocols – State-of-the art GEMS implementation Abstract model – Single address / region, two data values – Two cores with private L1 and unified L2, unordered n/w – Data race free guarantee for DeNovo – Cross phase interactions

Protocol Verification: Results Correctness – Three bugs in DeNovo protocol Mistakes in translation from the high level specification Simple to fix – Six bugs in MESI protocol Two deadlock scenarios Unhandled races due to L1 writebacks Several days to fix Complexity – 25x fewer reachable states for DeNovo – 30x difference in the runtime

Performance Evaluation Methodology Simulation Environment – Wisconsin GEMS + Simics + Princeton Garnet n/w System Parameters – 64 cores – Private L1(128KB) and Unified L2(32MB) Simple core model – 5-stage, one-issue, in-order core – Results for only memory stall time

Benchmarks FFT and LU from SPLASH-2 kdTree – two versions – Construction of k-D trees for ray tracing – Developed within UPCRC, presented at HPG 2010 – Two versions kdFalse: false sharing in an auxiliary structure kdPad: padding to eliminated false sharing

Simulated Protocols Word-based: 4B cache line – MESI and DeNovo Line-based: 64B cache line – MESI and DeNovo DeNovo extensions (proof of concept, word based) – DeNovo direct – DeNovo flex

Word Protocols FFTLUkdFalse kdPad Dword comparable to Mword Simplicity doesn’t compromise performance

DeNovo Direct Ddirect reduces remote L1 hit time FFTLUkdFalse kdPad

Line Protocols Dline not susceptible to false-sharing Up to 66% reduction in total time FFTLUkdFalse kdPad

Word vs. Line FFTLUkdFalse kdPad Application dependent benefit kdFalse – Mline worse by 12%

DeNovo Flex Dflex outperforms all systems Up to 73% reduction over Mline FFTLUkdFalse kdPad

Network Traffic Dline and Dflex less n/w traffic than Mline Up to 70% reduction FFTLUkdFalse kdPad

Conclusion Disciplined programming models key for software – DeNovo rethinks hardware for disciplined models Simplicity – 25x fewer reachable states with model checking – 30x runtime difference Extensibility – Direct cache-to-cache transfer – Flexible communication granularity Storage overhead – No storage overhead for directory information – Storage overheads beat MESI after tens of cores Performance/Power – 73% less time spent in memory requests – 70% reduction in n/w traffic

Future Work Rethinking cache data layout Extend to – Disciplined non-deterministic codes – Synchronization – Legacy codes Extend to off-chip memory Automate generation of hardware regions More extensive evaluations

Thank You!

Backup Slides

Flexible Coherence Granularity Byte-level sharing is uncommon – None of apps we studied so far Handle it correctly but not necessarily efficiently – Compiler aligns byte-granularity regions at word boundaries – If fails, H/W “clone” the line into 4 cache frames With at least 4-way associativity, all reside in the same set TagStateByte 0Byte 1Byte 2Byte 3 TagStateByte 3 Normal Cloned TagStateByte 2 TagStateByte 1 TagStateByte 0 TagStateByte 0Byte 1Byte 2Byte 3 TagStateByte 0Byte 1Byte 2Byte 3 TagStateByte 0Byte 1Byte 2Byte 3 Set