Presentation is loading. Please wait.

Presentation is loading. Please wait.

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

Similar presentations


Presentation on theme: "Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos"— Presentation transcript:

1 Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu www.eecg.toronto.edu/aenao

2 Moshovos © 2 CPU I$D$ CPU I$D$ CPU I$D$ interconnect Main Memory Improving Snoop Coherence  Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth  Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence?  Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping

3 Moshovos © 3 CPU I$D$ CPU I$D$ CPU I$D$ interconnect Main Memory RegionScout: Avoid Some Snoops n Frequent case: non-sharing even at a coarse level/Region n RegionScout: Dynamically Identify Non-Shared Regions l First Request to a Region Identifies it as not Shared l Subsequent Requests do not need to be broadcast n Uses Imprecise Information l Small structures l Layer on top of conventional coherence l No additional constraints

4 Moshovos © 4 Roadmap n Conventional Coherence: l The need for power-aware designs n Potential: Program Behavior n RegionScout: What and How n Implementation n Evaluation n Summary

5 Moshovos © 5 Coherence Basics n Given request for memory block X (address) n Detect where its current value resides Main Memory snoop X hit CPU

6 Moshovos © 6 Conventional Coherence not Power-Aware/Bandwidth-Effective All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power: All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Main Memory L2 CPU miss CPU

7 Moshovos © 7 RegionScout Motivation: Sharing is Coarse n Region: large continuous memory area, power of 2 size n CPU X asks for data block in region R 1. No one else has X 2. No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses

8 Moshovos © 8 Optimization Opportunities n Power and Bandwidth l Originating node: avoid asking others l Remote node: avoid tag lookup CPU I$D$ CPU I$D$ Memory SWITCH CPU I$D$

9 Moshovos © 9 Potential: Region Miss Frequency % of all requests Region Size Even with a 16K Region ~45% of requests miss in all remote nodes better Global Region Misses

10 Moshovos © 10 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region Main Memory CPU Global Region Miss Region Miss 1 22 3 Record: Non-Shared RegionsRecord: Locally Cached Regions

11 Moshovos © 11 RegionScout at Work: Avoiding Snoops Subsequent request avoids snoops Main Memory CPU Global Region Miss 1 2 Record: Non-Shared RegionsRecord: Locally Cached Regions

12 Moshovos © 12 RegionScout is Self-Correcting Request from another node invalidates non-shared record Main Memory CPU 1 22 Record: Non-Shared RegionsRecord: Locally Cached Regions

13 Moshovos © 13 n Requesting Node provides address: n At Originating Node – from CPU: l Have I discovered that this region is not shared? n At Remote Nodes – from Interconnect: l Do I have a block in the region? Implementation: Requirements Region Tag offset lg(Region Size) CPU address

14 Moshovos © 14 Remembering Non-Shared Regions n Records non-shared regions n Lookup by Region portion prior to issuing a request n Snoop requests and invalidate Region Tag offset address valid Non-Shared Region Table Few entries 16x4 in most experiments

15 Moshovos © 15 What Regions are Locally Cached? n If we had as many counters as regions: l Block Allocation: counter[region]++ l Block Eviction: counter[region]-- l Region cached only if counter[region] non-zero n Not Practical: l E.g., 16K Regions and 4G Memory  256K counters Region Tag offset counter

16 Moshovos © 16 What Regions are Locally Cached? n Use few Counters Imprecise: l Records a superset of locally cached Regions l False positives: lost opportunity, correctness preserved Region Tag offset counter hash Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 p bits P-bit 1 if counter non-zero used for lookups

17 Moshovos © 17 Roadmap n Conventional Coherence n Program Behavior: Region Miss Frequency n RegionScout n Evaluation n Summary

18 Moshovos © 18 Evaluation Overview n Methodology n Filter rates l Practical Filters can capture many Region Misses n Interconnect bandwidth reduction

19 Moshovos © 19 Methodology n In-House simulator based on Simplescalar l Execution driven l All instructions simulated – MIPS like ISA l System calls faked by passing them to host OS l Synchronization using load-linked/store-conditional l Simple in-order processors l Memory requests complete instantaneously l MESI snoop coherence l 1 or 2 level memory hierarchy l WATTCH power models n SPLASH II benchmarks l Scientific workloads l Feasibility study

20 Moshovos © 20 Filter Rates Identified Global Region Misses CRH Size better For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential

21 Moshovos © 21 Bandwidth Reduction Messages Region Size better CMP Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)

22 Moshovos © 22 Related Work n RegionScout l Technical Report, Dec. 2003 n Jetty l Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 n PST l Eckman, Dahlgren, and Stenström, ISLPED 2002 n Coarse-Grain Coherence l Cantin, Lipasti and Smith, ISCA 2005

23 Moshovos © 23 Summary n Exploit program behavior/optimize a frequent case l Many requests result in a global region miss n RegionScout l Practical filter mechanism l Dynamically detect would-be region misses l Avoid broadcasts l Save tag lookup power and interconnect bandwidth l Small structures l Layered extension over existing mechanisms l Invisible to programmer and the OS

24 Moshovos © 24 RegionScout and Directories n Different information l Directory block-level sharing l RegionScout: Region-level sharing u Could build Region-level directory u This work serves as motivation n Directories use precise information l RegionScout does not have to n Directories/Implementation n RegionScout can approximate a directory l If remote nodes sent sharing info as opposed to a single bit


Download ppt "Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos"

Similar presentations


Ads by Google