Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache.

Similar presentations


Presentation on theme: "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache."— Presentation transcript:

1 ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache Architecture for CMP

2 ECE8833 H.-H. S. Lee 2009 2 CMP Memory Hierarchy Continuing device scaling leads to –Deeper memory hierarchy (L2, L3, etc.) –Growing cache capacity 6MB in AMD’s Phenom quad-core 8MB in Intel Core i7 24MB L3 in Itanium 2 Global wire delay –Routing dominates access time Design for worst case –Compromise for the slowest access –Penalize overall memory accesses –Undesirable

3 ECE8833 H.-H. S. Lee 2009 3 Evolution of Cache Access Time Facts –Large shared on-die L2 –Wire-delay dominating on-die cache 3 cycles 1MB 180nm, 1999 11 cycles 4MB 90nm, 2004 24 cycles 16MB 50nm, 2010

4 ECE8833 H.-H. S. Lee 2009 4 Multi-banked L2 cache Bank=128KB 11 cycles 2MB @ 130nm Bank Access time = 3 cycles Interconnect delay = 8 cycles

5 ECE8833 H.-H. S. Lee 2009 5 Multi-banked L2 cache Bank=64KB 47 cycles 16MB @ 50nm Bank Access time = 3 cycles Interconnect delay = 44 cycles

6 ECE8833 H.-H. S. Lee 2009 6 NUCA: Non-Uniform Cache Architecture [Kim et al. ASPLOS-X, 2002] Partition a large cache into banks Non-uniform latencies for different banks Design space exploration –Mapping How many banks? (i.e., what’s the granularity) How to map lines to each bank? –Search Strategy for searching the set of possible locations for a line –Movement Should a line always be placed in the same bank? How a line migrates to different banks over its lifetime?

7 ECE8833 H.-H. S. Lee 2009 7 Cache Hierarchy Taxonomy (16MB @50nm) 41 UCA 1 bank 255 cycles Avg access time 41 ML-UCA 1 bank 11/41 cycles L3 10 L2 1741 1741 S-NUCA-1 32 banks 34 cycles S-NUCA-2 32 banks 24 cycles 9 32 D-NUCA 256 banks 18 cycles 4 47 Contentionless latency from CACTI From simulation modeling bank & channel conflict [Kim et al., ASPLOS-X 2002]

8 ECE8833 H.-H. S. Lee 2009 8 Static NUCA-1 Using Private Channels Upside –Increase the number of banks to avoid bulky access –Parallelize accesses to different banks Overhead –Decoders –Wire-dominated due to same set of private wires is required for every bank Each bank has its distinct access latency Statically pre-determine data location for its given address Average access latency =34.2 cycles Wire overhead = 20.9%  an issue Use low-order bits for bank index

9 ECE8833 H.-H. S. Lee 2009 9 Static NUCA-2 Using Switched Channels Improved wire congestion from Static NUCA-1 using 2D switched network Wormhole-routed flow control Each switch buffers 128-bit packets Average access latency =24.2 cycles –On avg, 0.8 cycle of “bank” contention + 0.7 cycle of “link” contention in the network Wire overhead = 5.9% Bank Data bus Switch Tag Array Wordline driver and decoder Predecoder 9 32

10 ECE8833 H.-H. S. Lee 2009 10 Dynamic NUCA Data can dynamically migrate Promote frequently used cache lines closer to CPU Data management –Mapping How many banks? (i.e., what’s the granularity) How to map lines to each bank? –Search Strategy for searching the set of possible locations for a line –Movement Should a line always be placed in the same bank? How a line migrates to different banks over its lifetime? D-NUCA 256 banks 18 cycles 4 47

11 ECE8833 H.-H. S. Lee 2009 11 Dynamic NUCA Simple Mapping All 4 ways of each bank set needs to be searched Non-uniform access times for different bank sets Farther bank sets  longer access Memory Controller 8 bank sets way 3 way 2 way 1 way 0 one set bank

12 ECE8833 H.-H. S. Lee 2009 12 Dynamic NUCA Fair Mapping (proposed, not studied in the paper) Average access time across all bank sets are equal Complex routing, likely more contention 8 bank sets one set bank Memory Controller

13 ECE8833 H.-H. S. Lee 2009 13 Dynamic NUCA Shared Mapping Sharing the closet banks among multiple banks Some banks have slightly higher associativity which offset the increased avg. access latency due to the distance 8 bank sets bank Memory Controller

14 ECE8833 H.-H. S. Lee 2009 14 Locate A NUCA Line Incremental search –From the closest to the farthest (Limited, partitioned) Multicast search –Search all (or a partition of) the banks in parallel –Return time depending on the routing distance Smart search –Use partial tag comparison [Kessler’89] (used in P6) –Keep the partial tag array in cache controller –Similar modern techniques: Bloom Filters

15 ECE8833 H.-H. S. Lee 2009 15 D-NUCA: Dynamic Movement of Cache Lines Cache Line Placement Upon Hit LRU ordering –Conventional implementation only adjust LRU bits –Require physical movement in order to get latency benefits for NUCA (n copy operations) Generational Promotion –Only swap with the line in the neighbor bank closest to the controller –Receive more “latency reward” when hit contiguously Hit Old state New state Old state New state Hit Controller

16 ECE8833 H.-H. S. Lee 2009 16 D-NUCA: Dynamic Movement of Cache Lines Upon Miss Incoming Line Insertion –To a distant bank –To MRU position Victim eviction –Zero copy –One copy Controller new Controller victim Most distant bank (assist cache concept) Controller victim Some distant bank (Zero copy) victim Controller Some distant bank (One copy) Controller victim MRU bank (One copy)

17 ECE8833 H.-H. S. Lee 2009 17 Sharing NUCA Cache in CMP Sharing Degree (SD) of N: Number of processor cores share a cache Low SD –Smaller private partitions –Good hit latency, poor hit rate –More discrete L2 caches Expensive L2 coherence E.g., Need a centralized L2 tag directory for L2 coherence High SD –Good hit rate, bad for hit latency –More efficient inter-core communication –More expensive L1 coherence

18 ECE8833 H.-H. S. Lee 2009 18 16-Core CMP Substrate and SD Low SD (e.g., 1), need either snooping or a central L2 tag directory for coherence High SD (e.g., 16) also needs some directory to indicate whose L1 has a copy (used in Piranha CMP) [Huh et al. ICS’05]

19 ECE8833 H.-H. S. Lee 2009 19 Trade-off for Cache Sharing Among Cores Upside –Keep single copy data –Use area more efficiently –Faster inter-core communication No coherence fabric Downside –Larger structure, slower access –Longer wire delay –More congestion on the shared interconnect

20 ECE8833 H.-H. S. Lee 2009 20 Flexible Cache Mapping Static mapping –Fixed L2 access latency upon line placement time Dynamic mapping –D-NUCA idea: line can migrate across multiple banks –Line will move closer to the core that frequently accesses it [Huh et al. ICS’05] Lookup could be expensive Search all partial tags first

21 ECE8833 H.-H. S. Lee 2009 21 Flexible Cache Sharing Multiple sharing degrees for different classes of blocks Classify lines to be (Per-line sharing degree) –Private (assign smaller SD) –Shared (assign larger SD) Study found 6 to 7% improvement vs. the best uniform SD –SD=1 or 2 for private data –SD=16 for shared data

22 ECE8833 H.-H. S. Lee 2009 22 Enhance Cache/Memory Performance Cache Partitioning –Explicitly manage cache allocation among processes Each process gets different benefit for more cache space Similar to main memory partition [Stone’92] in the good old days Memory-aware Scheduling –Choose a set of simultaneous processes to minimize cache contention –Symbiotic scheduling for SMT by OS Sample and collect info (perf. counters) about possible schedules Predict the best schedule (e.g., based on resource contention) Complexity is high for many processes –Admission control for gang scheduling Based on footprint of a job (total memory usage) Slide adapted from Ed Suh’s HPCA’02 presentation

23 Victim Replication

24 ECE8833 H.-H. S. Lee 2009 24 Today’s Chip Multiprocessors (Shared L2) core L1$ Layout: “Dance-Hall” –Per processing node: Core + L1 cache –Shared L2 cache Small L1 cache –Fast access Large L2 cache –Good hit rate –Slower access latency Intra-Chip Switch core L1$ core L1$ core L1$ L2 Cache Slide adapted from presentation by Zhang and Asanovic, ISCA’05 Intra-Chip Switch core L1$ core L1$ core L1$ core L1$

25 ECE8833 H.-H. S. Lee 2009 25 Today’s Chip Multiprocessors (Shared L2) Layout: “Dance-Hall” –Per processing node: Core + L1 cache –Shared L2 cache Alternate large L2 cache –Divided into slices to minimize latency and power –i.e., NUCA Challenge –Minimize average access latency –Avg memory latency == Best latency Slide adapted from presentation by Zhang and Asanovic, ISCA’05 L2 Slice core L1$ L2 Slice Intra-Chip Switch core L1$ core L1$ core L1$ Intra-Chip Switch core L1$ core L1$ core L1$ core L1$

26 ECE8833 H.-H. S. Lee 2009 26 Dynamic NUCA Issues Does not work well with CMPs The “unique” copy of data cannot be close to all of its sharers Behavior –Over time, shared data migrates to a location “equidistant” to all sharers [Beckmann & Wood, MICRO-36] core L1$ Intra-Chip Switch core L1$ core L1$ core L1$ Intra-Chip Switch core L1$ core L1$ core L1$ core L1$ Slide adapted from presentation by Zhang and Asanovic, ISCA’05

27 ECE8833 H.-H. S. Lee 2009 27 Tiled CMP with Directory-Based Protocol Tiled CMPs for Scalability –Minimal redesign effort –Use directory-based protocol for scalability Managing the L2s to minimize the effective access latency –Keep data close to the requestors –Keep data on-chip Two baseline L2 cache designs –Each tile has own private L2 –All tiles share a single distributed L2 SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag SW cL1 L2$ Data L2$ Tag coreL1$ L2$ Slice Data Switch L2$ Slice Tag Slide adapted from presentation by Zhang and Asanovic, ISCA’05

28 ECE8833 H.-H. S. Lee 2009 28 “Private L2” Design Keeps Low Hit Latency coreL1$ Private L2$ Data Switch DIR L2$ Tag coreL1$ Private L2$ Data Switch DIR L2$ Tag Sharer j Sharer i The local L2 slice is used as a private L2 cache for the tile –Shared data is “duplicated” in the L2 of each sharer –Coherence must be kept among all sharers at the L2 level –Similar to DSM On an L2 miss: –Data not on-chip –Data available in the private L2 cache of another chip Slide adapted from presentation by Zhang and Asanovic, ISCA’05

29 ECE8833 H.-H. S. Lee 2009 29 “Private L2” Design Keeps Low Hit Latency coreL1$ Private L2$ Data Switch DIR L2$ Tag coreL1$ Private L2$ Data Switch DIR L2$ Tag coreL1$ Private L2$ Data Switch DIR L2$ Tag Home Node statically determined by address Owner/Sharer Requestor The local L2 slice is used as a private L2 cache for the tile –Shared data is “duplicated” in the L2 of each sharer –Coherence must be kept among all sharers at the L2 level –Similar to DSM On an L2 miss: –Data not on-chip –Data available in the private L2 cache of another tile (cache-to- cache reply-forwarding) Off-chip Access

30 ECE8833 H.-H. S. Lee 2009 30 “Shared L2” Design Gives Maximum Capacity core L1$ Shared L2$ Data Switch DIR L2$ Tag core L1$ Shared L2$ Data Switch DIR L2$ Tag Requestor core L1$ Shared L2$ Data Switch DIR L2$ Tag Owner/Sharer Off-chip Access All L2 slices on-chip form a distributed shared L2, backing up all L1s –“No duplication,” data kept in a unique L2 location –Coherence must be kept among all sharers at the L1 level On an L2 miss: –Data not in L2 –Coherence miss (cache-to- cache reply-forwarding) Home Node statically determined by address

31 ECE8833 H.-H. S. Lee 2009 31 Private vs. Shared L2 CMP Shared L2 –Long/non-uniform L2 hit latency –No duplication maximizes L2 capacity Private L2 –Uniform lower latency if found in local L2 –Duplication reduces L2 capacity

32 ECE8833 H.-H. S. Lee 2009 32 Private vs. Shared L2 CMP Shared L2 –Long/non-uniform L2 hit latency  No duplication maximizes L2 capacity Private L2  Uniform lower latency if found in local L2 –Duplication reduces L2 capacity Victim Replication: Provides low hit latency while keeping the working set on-chip

33 ECE8833 H.-H. S. Lee 2009 33 Normal L1 Eviction for a Shared L2 CMP coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer iSharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Home Node When an L1 cache line is being evicted –Write back to home L2 if dirty –Update home directory

34 ECE8833 H.-H. S. Lee 2009 34 Victim Replication coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer iSharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Home Node Replicas –L1 victims stored in the Local L2 slice Reused later for faster access latency

35 ECE8833 H.-H. S. Lee 2009 35 Hitting the Victim Replica coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer iSharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Home Node Look up local L2 slice A miss will follow the normal transaction to get the line in home node A replica hit will invalidate the replica Replica Hit

36 ECE8833 H.-H. S. Lee 2009 36 Replication Policy Replica is only inserted when one of the following is found (in the priority) coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Switch DIR Sharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer i Home Node

37 ECE8833 H.-H. S. Lee 2009 37 Replication Policy, Where to Insert? Replica is only inserted when one of the following is found (in the priority) –Invalid line coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Switch DIR Sharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer i Home Node

38 ECE8833 H.-H. S. Lee 2009 38 Replication Policy, Where to Insert? Replica is only inserted when one of the following is found (in the priority) –Invalid line –A global line with no sharer coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Switch DIR Sharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer i The line in its home with no sharer Home Node

39 ECE8833 H.-H. S. Lee 2009 39 Replication Policy, Where to Insert? Replica is only inserted when one of the following is found (in the priority) –Invalid line –A global line with no sharer –An existing replica coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Switch DIR Sharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer i Home Node

40 ECE8833 H.-H. S. Lee 2009 40 Replication Policy, Where to Insert? Replica is only inserted when one of the following is found (in the priority) –Invalid line –A global line with no sharer –An existing replica Line is never replicated when –A global line has remote sharers coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Switch DIR Sharer j coreL1$ Shared L2$ Data Switch DIR L2$ Tag Sharer i  Home Node

41 ECE8833 H.-H. S. Lee 2009 41 Replication Policy, Where to Insert? Replica is only inserted when one of the following is found (in the priority) –Invalid line –A global line with no sharer –An existing replica Line is never replicated when –A global line has remote sharers –the victim’s home tile is local coreL1$ Shared L2$ Data Switch DIR L2$ Tag coreL1$ Switch DIR coreL1$ Shared L2$ Data Switch DIR L2$ Tag  Home Node

42 ECE8833 H.-H. S. Lee 2009 42 VR Combines Global Lines and Replica coreL1$ Switch DIR L2$ Tag coreL1$ Switch DIR L2$ Tag coreL1$ Switch DIR L2$ Tag Shared L2$ Private L2$ (filled w/ L1 victims) Shared L2$ Private L2$ Private L2 DesignShared L2 Design Victim Replication Victim Replication dynamically creates a large local private, victim cache for the local L1 cache Slide adapted from presentation by Zhang and Asanovic, ISCA’05

43 ECE8833 H.-H. S. Lee 2009 43 When Working Set Does not Fit in Local L2 Off-chip misses Hits in Non-Local L2 Hits in Local L2 Hits in L1 The capacity advantage of the shared design yields many fewer off-chip misses The latency advantage of the private design is offset by costly off-chip accesses Victim replication is even better than shared design by creating replicas to reduce access latency L2PL2SL2VR L2PL2SL2VR Average Data Access Latency Access Breakdown Best Very Good O.K. Not Good …

44 ECE8833 H.-H. S. Lee 2009 44 Average Latencies of Different CMPs Single thread applications L2VR excels 11 out of 12 cases Multi-programmed workload L2P is always the best


Download ppt "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 5 Non-Uniform Cache."

Similar presentations


Ads by Google