Download presentation
Presentation is loading. Please wait.
Published byChad Norton Modified over 9 years ago
1
1 CACM July 2012 Talk: Mark D. Hill, Wisconsin @ Cornell University, 10/2012
2
2 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!
3
3Outline Motivation & Coherence Background Scalability Challenges 1.Communication 2.Storage 3.Enforcing Inclusion 4.Latency 5.Energy Extension to Non-Inclusive Shared Caches Criticisms & Summary
4
4 Academics Criticize HW Coherence Choi et al. [DeNovo]: o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. Kelm et al. [Cohesion]: o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic...
5
5 Industry Eschews HW Coherence Intel 48-Core IA-32 Message- Passing Processor … SW protocols … to eliminate the communication & HW overhead IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory BUT…
6
6 Source: Avinash Sodani "Race to Exascale: Challenges and Opportunities,“ Micro 2011.
7
Define “Coherence as Scalable” Define a coherent system as scalable when the cost of providing coherence grows (at most) slowly as core count increases Our Focus o YES: coherence o NO: Any scalable system also requires scalable HW (interconnects, memories) and SW (OS, middleware, apps) Method o Identify each overhead & show it can grow slowly Expect more cores o Moore Law’s provide more transistors o Power-efficiency improvements (w/o Dennard Scaling) o Experts disagree on how many core possible 7
8
Caches & Coherence Cache— fast, hidden memory—to reduce o Latency: average memory access time o Bandwidth: interconnect traffic o Energy: cache misses cost more energy Caches hidden (from software) o Naturally for single core system o Via Coherence Protocol for multicore Maintain coherence invariant o For a given (memory) block at a give time either o Modified (M): A single core can read & write o Shared (S): Zero or more cores can read, but not write 8
9
Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 9 Baseline Multicore Chip Intel Core i7 like C = 16 Cores (not 8) Private L1/L2 Caches Shared Last-Level Cache (LLC) 64B blocks w/ ~8B tag HW coherence pervasive in general-purpose multicore chips: AMD, ARM, IBM, Intel, Sun (Oracle)
10
Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits 10 Baseline Chip Coherence 2B per 64+8B L2 block to track L1 copies Inclusive L2 (w/ recall messages on LLC evictions)
11
11 Coherence Example Setup Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B:
12
12 Coherence Example 1/4 Block A at Core 0 exclusive read-write: Modified(M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {0000} I … A: Bank 2 Bank 3 {0000} I … B: Write A M, … A: {1000} M …
13
13 Coherence Example 2/4 Block B at Cores 1+2 shared read-only: Shared (S) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0000} I … B: Read B M, … A: S, … B: {0100} S … Read B S, … B: {0110} S …
14
14 Coherence Example 3/4 Block A moved from Core 0 to 3 (still M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: M, … A: Write A M, … A: {0001} M …
15
15 Coherence Example 4/4 Block B moved from Cores1+2 (S) to Core 1 (M) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache Core 3 Private cache S, … B: S, … B: Bank 0 Bank 1 A: Bank 2 Bank 3 {0110} S … B: M, … B: Write B M, … A: {0001} M … {1000} M …
16
Caches & Coherence 16
17
17Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary
18
18 1. Communication: (a) No Sharing, Dirty Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request Data Data(writeback) o W/ coherence: Request Data Data(writeback) Ack o Overhead = 8/(8+72+72) = 5% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
19
19 1. Communication: (b) No Sharing, Clean Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o W/o coherence: Request Data 0 o W/ coherence: Request Data (Evict) Ack o Overhead = 16/(8+72) = 10-20% (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
20
20 1. Communication: (c) Sharing, Read Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o To memory: Request Data o To one other core: Request Forward Data (Cleanup) o Charge 1-2 Control messages (independent of #cores!) Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
21
21 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request { Data, C Invalidations + C Acks} (Cleanup) o Needed since most directory protocols send invalidations to caches that have & sometimes do not have copies o Not Scalable Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
22
22 1. Communication: Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request Data o Core C Write: Request { Data, 2 Inv + 2 Acks} (Cleanup) o Charge Write for all necessary & unnecessary invalidations o What if all invalidations necessary? Charge reads that get data! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1|2 3|4.. C-1|C} { 0 0.. 0 } { 1 0.. 0 }{ 0 0.. 1 }
23
23 1. Communication: No Extra Invalidations Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o Core 1 Read: Request Data + {Inv + Ack} (in future) o Core C Write: Request Data (Cleanup) o If all invalidations necessary, coherence adds o Bounded overhead to each miss -- Independent of #cores! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data {1 2 3 4.. C-1 C} {0 0 0 0.. 0 0} {1 0 0 0.. 0 0}{0 0 0 0.. 0 1}
24
24 1. Communication Overhead (1) Communication overhead bounded & scalable (a) Without Sharing & Dirty (b) Without Sharing & Clean (c) Shared Read Miss (charge future inv + ack) (d) Shared Write Miss (not charged for inv + acks) But depends on tracking exact sharers (next)
25
25 Total Communication C Read Misses per Write Miss Exact (unbounded storage) Inexact (32b coarse vector) How get performance of “exact” w/ reasonable storage?
26
26Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary
27
27 2. Storage Overhead (Small Chip) Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Interconnection network tracking bits state tag block data Core 1 Private cache state tag block data Core 2 Private cache Core C Private cache Block in private cache Block in shared cache ~2 bits ~64 bits ~512 bits ~C bits ~2 bits ~64 bits ~512 bits
28
28 2. Storage Overhead (Larger Chip) Use Hierarchy! core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Inter-cluster Interconnection network core private cache core private cache core private cache Intra-cluster Interconnection network Cluster of K cores tracking state bits tag block data Cluster Cache Cache tracking state bits tag block data Shared last-level cache Cluster 1Cluster K {11..1 … 10..1} S … {11..1} S …{10..1} S … {1 … 1} S …
29
29 2. Storage Overhead (Larger Chip) Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores E.g., 16 16-core clusters o 256 cores (16*16) o 3% storage overhead!! More generally?
30
30 Storage Overhead for Scaling (2) Hierarchy enables scalable storage 16 clusters of 16 cores each
31
31Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary
32
32 3. Enforcing Inclusion (Subtle) Inclusion: Block in a private cache In shared cache + Augment shared cache to track private cache sharers (as assumed) -Replace in shared cache Replace in private c. -Make impossible? -Requires too much shared cache associativity -E.g., 16 cores w/ 4-way caches 64-way assoc -Use recall messages Make recall messages necessary & rare
33
33 Inclusion Recall Example Shared cache miss to new block C Needs to replace (victimize) block B in shared cache Inclusion forces replacement of B in private caches Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A: Write C
34
34 Make All Recalls Necessary Exact state tracking (cover earlier) + L1/L2 replacement messages (even clean) = Every recall message finds cached block Every recall message necessary & occurs after a cache miss (bounded overhead)
35
35 Make Necessary Recalls Rare Recalls naturally rare when Shared Cache Size/ Σ Private Cache sizes > 2 (3) Recalls made rare Assume misses to random sets [Hill & Smith 1989] Core i7
36
36Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary
37
37 4. Latency Overhead – Often None Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache 1.None: private hit 2.“None”: private miss + “direct” shared cache hit 3.“None”: private miss + shared cache miss 4.BUT … Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
38
38 4. Latency Overhead -- Some Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache 4. 1.5-2X: private miss + shared cache hit with indirection(s) How bad? Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
39
4. Latency Overhead -- Indirection 4. 1.5-2X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect --------------------------------------------------------------------------------------------- interconnect + cache + interconnect Acceptable today Relative latency similar w/ more cores/hierarchy Vs. magically having data at shared cache (4) Latency overhead bounded & scalable 39
40
5. Energy Overhead Dynamic -- Small o Extra message energy – traffic increase small/bounded o Extra state lookup – small relative to cache block lookup o…o… Static – Also Small o Extra state – state increase small/bounded o…o… Little effect on energy-intensive cores, cache data arrays, off-chip DRAM, secondary storage, … (5) Energy overhead bounded & scalable 40
41
41Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Apply analysis to caches used by AMD Criticisms & Summary
42
42 Review Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache Inclusive Shared Cache: Block in a private cache In shared cache Blocks must be cached redundantly tracking bits state tag block data ~1 bit per core ~2 bits ~64 bits ~512 bits
43
43 Non-Inclusive Shared Cache Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache tracking bits state tag ~1 bit per core ~2 bits ~64 bits 2. Inclusive Directory (probe filter) state tag block data ~2 bits ~64 bits ~512 bits 1. Non-Inclusive Shared Cache Any size or associativity Avoids redundant caching Allows victim caching Dataless Ensures coherence But duplicates tags
44
44 Non-Inclusive Shared Cache Non-Inclusive Shared Cache: Data Block + Tag (Any Configuration ) Inclusive Directory: Tag (Again) + State Inclusive Directory == Coherence State Overhead WITH TWO LEVELS o Directory size proportional to sum of private cache sizes o 64b/(48b+512b) * 2 (for rare recalls) = 22% * Σ L1 size Coherence overhead higher than w/ inclusion L2 / ΣL1s1248 Overhead11%7.6%4.6%2.5%
45
45 Non-Inclusive Shared Caches WITH THREE LEVELS Cluster has L2 cache & cluster directory o Cluster directory points to cores w/ L1 block (as before) o (1) Size = 22% * ΣL1s sizes Chip has L3 cache & global directory o Global directory points to cluster w/ block in o (2) Cluster directory for size 22% * ΣL1s + o (3) Cluster L2 cache for size 22% * ΣL2s Hierarchical overhead higher than w/ inclusion L3 / ΣL2 = L2 / ΣL1s 1248 Overhead (1)+(2)+(3) 23% 13%6.5%3.1%
46
46Outline Motivation & Coherence Background Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages (subtle) 4.Latency: Indirection on some requests 5.Energy: Dynamic & static overhead Extension to Non-Inclusive Shared Caches (subtle) Criticisms & Summary
47
Some Criticisms (1) Where are workload-driven evaluations? o Focused on robust analysis of first-order effects (2) What about non-coherent approaches? o Showed compatible of coherence scales (3) What about protocol complexity? o We have such protocols today (& ideas for better ones) (4) What about multi-socket systems? o Apply non-inclusive approaches (5) What about software scalability? o Hard SW work need not re-implement coherence
48
48 Executive Summary Today chips provide shared memory w/ HW coherence as low-level support for OS & application SW As #cores per chip scales? o Some argue HW coherence gone due to growing overheads o We argue it’s stays by managing overheads Develop scalable on-chip coherence proof-of-concept o Inclusive caches first o Exact tracking of sharers & replacements (key to analysis) o Larger systems need to use hierarchy (clusters) o Overheads similar to today’s Compatibility of on-chip HW coherence is here to stay Let’s spend programmer sanity on parallelism, not lost compatibility!
49
Coherence NOT this Awkward 49
50
50 Backup Slides Some old
51
51Outline Baseline Multicore Chip & Coherence Scalability Challenges 1.Communication: Extra bookkeeping messages (longer section) 2.Storage: Extra bookkeeping storage 3.Enforcing Inclusion: Extra recall messages 4.Latency: Indirection on some requests 5.Energy: dynamic & static overhead Extension to Non-Inclusive Shared Caches Criticisms & Summary
52
52 Coherence Example SAVE Block A in no private caches: state Invalid (I) Block B in no private caches: state Invalid (I) Interconnection network Core 0 Private cache Core 1 Private cache Core 2 Private cache M, … S, … Core 3 Private cache B: S, … B: Bank 0 Bank 1 {1000} M … A: Bank 2 Bank 3 {0110} S … B: A:
53
53 1. Communication Overhead WITHOUT SHARING o 8-byte control messages: Request, Evict, Ack o 72-byte messages for 64-byte data Dirty Blocks o W/o coherence: Request + Data + Data(writeback) o W/ coherence: Request + Data + Data(writeback) + Ack o Overhead = 8/(8+72+72) = 5% Clean Blocks o W/o coherence: Request + Data + 0 o W/ coherence: Request + Data + (Evict) + Ack o Overhead = 16/(8+72) = 10-20% Overhead independent of #cores
54
54 1. Communication Overhead WITH SHARING Read Miss o To memory: Request Data o To one other core: Request Forward Data (Cleanup) o Charge (at most) 1 Invalidation + 1 Ack Write Miss o To one other core: Request Forward Data (Cleanup ) o To C other cores: As above + C Invalidations C Acks o If every invalidation useful, charge Read not Write Miss (1) Communication overhead bounded & scalable But depends on tracking exact sharers (next)
55
55 2. Storage Overhead Track up to C=#readers (cores) per LLC block Small #Cores: C bit vector acceptable o e.g., 16 bits for 16 cores : 2 bytes / 72 bytes = 3% Medium-Large #Cores: Use Hierarchy! o Cluster: K1 cores with L2 cluster cache o Chip: K2 clusters with L3 global cache o Enables K1*K2 Cores (picture next)
56
4. Latency Overhead Added coherence latency (ignoring hierarchy) 1.None: private hit 2.“None”: private miss + shared cache miss 3.“None”: private miss + “direct” shared cache hit 4.1.5-2X: private miss + shared cache hit with indirection(s) interconnect + cache + interconnect + cache + interconnect --------------------------------------------------------------------------------------------- interconnect + cache + interconnect Acceptable today Not significantly changed by scale or hierarchy (4) Latency overhead bounded & scalable 56
57
57 1. Communication: (d) Sharing, Write Interconnection network Core 1 Private cache Core 2 Private cache Core C Private cache o If Shared at C other cores o Request { Data, C Invalidations + C Acks} (Cleanup) o If every invalidation useful, charge Read not Write Miss o Overhead independent of #Cores for all cases! Key: Green for Required Red for Overhead Thin is 8-byte control Thick is 72-byte data
58
Why On-Chip Cache Coherence is Here to Stay Milo M. K. Martin. Univ. of Pennsylvania Mark D. Hill, Univ. of Wisconsin Daniel J. Sorin, Duke Univ. October 2012 @ Wisconsin Appears in [Communications of the ACM, July 2012] Study cache coherence performance WITHOUT trace-driven or execution driven simulation!
59
59 A Future for HW Coherence? Academics criticize HW coherence o Directory…coherence…extremely complex & inefficient.... Directory … incurring significant storage and invalidation traffic overhead. –Choi et al. [DeNovo] o A software-managed coherence protocol... avoids.. directories and duplicate tags, & implementing & verifying … less traffic... –Kelm et al. [Cohesion] Industry experiments with avoiding HW coherence o Intel 48-Core IA-32 Message-Passing Processor … SW protocols … to eliminate the communication & HW overhead o IBM Cell processor … the greatest opportunities for increased application performance, is the existence of the local store memory and the fact that software must manage this memory
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.