Download presentation
Presentation is loading. Please wait.
Published byDeborah Nicholson Modified over 9 years ago
1
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu www.eecg.toronto.edu/aenao
2
Moshovos © 2 CPU I$D$ CPU I$D$ CPU I$D$ interconnect Main Memory Improving Snoop Coherence Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping
3
Moshovos © 3 CPU I$D$ CPU I$D$ CPU I$D$ interconnect Main Memory RegionScout: Avoid Some Snoops n Frequent case: non-sharing even at a coarse level/Region n RegionScout: Dynamically Identify Non-Shared Regions l First Request to a Region Identifies it as not Shared l Subsequent Requests do not need to be broadcast n Uses Imprecise Information l Small structures l Layer on top of conventional coherence l No additional constraints
4
Moshovos © 4 Roadmap n Conventional Coherence: l The need for power-aware designs n Potential: Program Behavior n RegionScout: What and How n Implementation n Evaluation n Summary
5
Moshovos © 5 Coherence Basics n Given request for memory block X (address) n Detect where its current value resides Main Memory snoop X hit CPU
6
Moshovos © 6 Conventional Coherence not Power-Aware/Bandwidth-Effective All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power: All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Main Memory L2 CPU miss CPU
7
Moshovos © 7 RegionScout Motivation: Sharing is Coarse n Region: large continuous memory area, power of 2 size n CPU X asks for data block in region R 1. No one else has X 2. No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses
8
Moshovos © 8 Optimization Opportunities n Power and Bandwidth l Originating node: avoid asking others l Remote node: avoid tag lookup CPU I$D$ CPU I$D$ Memory SWITCH CPU I$D$
9
Moshovos © 9 Potential: Region Miss Frequency % of all requests Region Size Even with a 16K Region ~45% of requests miss in all remote nodes better Global Region Misses
10
Moshovos © 10 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region Main Memory CPU Global Region Miss Region Miss 1 22 3 Record: Non-Shared RegionsRecord: Locally Cached Regions
11
Moshovos © 11 RegionScout at Work: Avoiding Snoops Subsequent request avoids snoops Main Memory CPU Global Region Miss 1 2 Record: Non-Shared RegionsRecord: Locally Cached Regions
12
Moshovos © 12 RegionScout is Self-Correcting Request from another node invalidates non-shared record Main Memory CPU 1 22 Record: Non-Shared RegionsRecord: Locally Cached Regions
13
Moshovos © 13 n Requesting Node provides address: n At Originating Node – from CPU: l Have I discovered that this region is not shared? n At Remote Nodes – from Interconnect: l Do I have a block in the region? Implementation: Requirements Region Tag offset lg(Region Size) CPU address
14
Moshovos © 14 Remembering Non-Shared Regions n Records non-shared regions n Lookup by Region portion prior to issuing a request n Snoop requests and invalidate Region Tag offset address valid Non-Shared Region Table Few entries 16x4 in most experiments
15
Moshovos © 15 What Regions are Locally Cached? n If we had as many counters as regions: l Block Allocation: counter[region]++ l Block Eviction: counter[region]-- l Region cached only if counter[region] non-zero n Not Practical: l E.g., 16K Regions and 4G Memory 256K counters Region Tag offset counter
16
Moshovos © 16 What Regions are Locally Cached? n Use few Counters Imprecise: l Records a superset of locally cached Regions l False positives: lost opportunity, correctness preserved Region Tag offset counter hash Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 p bits P-bit 1 if counter non-zero used for lookups
17
Moshovos © 17 Roadmap n Conventional Coherence n Program Behavior: Region Miss Frequency n RegionScout n Evaluation n Summary
18
Moshovos © 18 Evaluation Overview n Methodology n Filter rates l Practical Filters can capture many Region Misses n Interconnect bandwidth reduction
19
Moshovos © 19 Methodology n In-House simulator based on Simplescalar l Execution driven l All instructions simulated – MIPS like ISA l System calls faked by passing them to host OS l Synchronization using load-linked/store-conditional l Simple in-order processors l Memory requests complete instantaneously l MESI snoop coherence l 1 or 2 level memory hierarchy l WATTCH power models n SPLASH II benchmarks l Scientific workloads l Feasibility study
20
Moshovos © 20 Filter Rates Identified Global Region Misses CRH Size better For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential
21
Moshovos © 21 Bandwidth Reduction Messages Region Size better CMP Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)
22
Moshovos © 22 Related Work n RegionScout l Technical Report, Dec. 2003 n Jetty l Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 n PST l Eckman, Dahlgren, and Stenström, ISLPED 2002 n Coarse-Grain Coherence l Cantin, Lipasti and Smith, ISCA 2005
23
Moshovos © 23 Summary n Exploit program behavior/optimize a frequent case l Many requests result in a global region miss n RegionScout l Practical filter mechanism l Dynamically detect would-be region misses l Avoid broadcasts l Save tag lookup power and interconnect bandwidth l Small structures l Layered extension over existing mechanisms l Invisible to programmer and the OS
24
Moshovos © 24 RegionScout and Directories n Different information l Directory block-level sharing l RegionScout: Region-level sharing u Could build Region-level directory u This work serves as motivation n Directories use precise information l RegionScout does not have to n Directories/Implementation n RegionScout can approximate a directory l If remote nodes sent sharing info as opposed to a single bit
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.