Download presentation
Presentation is loading. Please wait.
Published byMike Sherin Modified over 10 years ago
1
Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation
2
Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary
3
Motivation CMP architecture has been widely adopted SCMP: a few large out-of-order cores Intel Dual-core Xeon processor LCMP: many small in-order cores Sun Niagara, Azul High throughput Questions on cache/memory hierarchy How do we prune the cache design space for LCMP architectures? What methodology needs to be put in place? How should the cache be sized at each level and shared at each level in the hierarchy? How much memory and interconnect bandwidth is required for scalable performance? The goal of this paper is to accomplish a first-level of analysis that narrows the design space
4
Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary
5
Overview of LCMP 16 or 32 light weight cores on-die C L1 C L2 C L1 C L2 L3 Memory interface C L1 C L2 C L1 C L2 Interconnect CPU (LCMP) IO Bridge DRAM IO interface
6
Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary
7
Cache Design Considerations Die area constraints Only a fraction of space (40 to 60%) may be available to cache On-die and off-die bandwidth On-die interconnect carries the communication between cache hierarchy Off-die memory bandwidth Power consumption Overall performance Indicate the effectiveness of the cache design in supporting many simultaneous threads of execution
8
Constraint-Aware Analysis Methodology Area constraints Prune the design space by the area constraints Estimate the area required for L2, then apply the overall area constraints to this cache Bandwidth constraints further prune the options of those already pruned by area constrains by applying the on-die and off-die bandwidth constraints Estimate the number of requests generated by the caches at each level, which depends on core performance and cache performance for a given workload Overall performance Compare the performance of the pruned options, determine the top two or three design choices
9
Outline Motivation Overview of LCMP Constraint-aware Analysis Methodology Experiment Results Area and Bandwidth Implications Performance Evaluation Summary
10
Experimental Setup Platform simulator Core model Cache hierarchy Interconnect model Memory model Area estimation tool CACTI 3.2 Workload and traces OLTP: TPC-C SAP: SAP SD 2-tier workload JAVA: SPECjbb2005 Baseline configuration 32 cores (4threads/core), core CPI = 6 Several nodes (1 to 4 cores/node), L2 per node (128K to 4M)
11
Area Constraints Look for options that support 3 levels of cache Assume total die area is 400 mm 2 Two constraints of cache size: 50% 200 mm 2, 75% 300 mm 2 Inclusive cache L3 >= 2xL2 200 mm 2 300 mm 2
12
Sharing Impact MPI reduces when we increase the sharing degree 512K seems to be a sweet spot TPCC C L2 CCC CCCC CCCC
13
Bandwidth Constraints 4 cores/node, 8 nodes On-die BW demand is around 180GB/s with infinite L3, reduces significantly with a 32M L3 cache Off-die memory BW demand is between 40 to 50 GB/s, reduces as we increase the L3 cache size On-die bandwidthOff-die bandwidth
14
Cache Options Summary Node 1 to 4 cores L2 size per core Around 128K to 256K seems viable for 32-core LCMP L3 size ranging from 8M to about 20M depending on the configuration can be considered Cores/node# of nodesL2 cache/nodeL3 Cache size 132128K~ 12M 216256K – 512K8M – 16M 48512K – 1M10M – 18M
15
Performance Evaluation (TPCC) On-die BW is 512 GB/s, max sustainable memory BW is 64 GB/s Performance: configuration (4 cores/node, 1M L2 and 32M L3) is the best Performance per unit area: config (4 cores/node, 512K L2 and 8M L3) is the best Performance3 per unit area: (4 cores/node, 512K to 1M L2, 8M to 16M L3)
16
Performance Evaluation (SAP, SPECjbb)
17
Implications and Inferences design a 3-level cache hierarchy Each node consists of four cores, 512K to 1M of L2 cache The L3 cache size is recommended to be a minimum of 16M Recommend that the platform support at least 64GB/s of memory bandwidth and 512GB/s of interconnect bandwidth
18
Summary Performed the first study of performance, area and bandwidth implications on LCMP cache design Introduced a constraints-aware analysis methodology to explore LCMP cache design options Applied this methodology to a 32-core LCMP architecture Quickly narrowed down the design space to a small subset of viable options Conducted an in-depth performance/area evaluation of these options and summarize a set of recommendations for architecting efficient LCMP platforms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.