Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don Newell Intel Corporation

Outline  Motivation  Overview of LCMP  Constraint-aware Analysis Methodology  Experiment Results Area and Bandwidth Implications Performance Evaluation  Summary

Motivation  CMP architecture has been widely adopted SCMP: a few large out-of-order cores  Intel Dual-core Xeon processor LCMP: many small in-order cores  Sun Niagara, Azul  High throughput  Questions on cache/memory hierarchy How do we prune the cache design space for LCMP architectures? What methodology needs to be put in place? How should the cache be sized at each level and shared at each level in the hierarchy? How much memory and interconnect bandwidth is required for scalable performance? The goal of this paper is to accomplish a first-level of analysis that narrows the design space

Overview of LCMP  16 or 32 light weight cores on-die C L1 C L2 C L1 C L2 L3 Memory interface C L1 C L2 C L1 C L2 Interconnect CPU (LCMP) IO Bridge DRAM IO interface

Cache Design Considerations  Die area constraints Only a fraction of space (40 to 60%) may be available to cache  On-die and off-die bandwidth On-die interconnect carries the communication between cache hierarchy Off-die memory bandwidth  Power consumption  Overall performance Indicate the effectiveness of the cache design in supporting many simultaneous threads of execution

Constraint-Aware Analysis Methodology  Area constraints Prune the design space by the area constraints Estimate the area required for L2, then apply the overall area constraints to this cache  Bandwidth constraints further prune the options of those already pruned by area constrains by applying the on-die and off-die bandwidth constraints Estimate the number of requests generated by the caches at each level, which depends on core performance and cache performance for a given workload  Overall performance Compare the performance of the pruned options, determine the top two or three design choices

Experimental Setup  Platform simulator Core model Cache hierarchy Interconnect model Memory model  Area estimation tool CACTI 3.2  Workload and traces OLTP: TPC-C SAP: SAP SD 2-tier workload JAVA: SPECjbb2005  Baseline configuration 32 cores (4threads/core), core CPI = 6 Several nodes (1 to 4 cores/node), L2 per node (128K to 4M)

Area Constraints  Look for options that support 3 levels of cache  Assume total die area is 400 mm 2  Two constraints of cache size: 50%  200 mm 2, 75%  300 mm 2  Inclusive cache  L3 >= 2xL2 200 mm 2 300 mm 2

Sharing Impact  MPI reduces when we increase the sharing degree  512K seems to be a sweet spot TPCC C L2 CCC CCCC CCCC

Bandwidth Constraints  4 cores/node, 8 nodes  On-die BW demand is around 180GB/s with infinite L3, reduces significantly with a 32M L3 cache  Off-die memory BW demand is between 40 to 50 GB/s, reduces as we increase the L3 cache size On-die bandwidthOff-die bandwidth

Cache Options Summary  Node 1 to 4 cores  L2 size per core Around 128K to 256K seems viable for 32-core LCMP  L3 size ranging from 8M to about 20M depending on the configuration can be considered Cores/node# of nodesL2 cache/nodeL3 Cache size 132128K~ 12M 216256K – 512K8M – 16M 48512K – 1M10M – 18M

Performance Evaluation (TPCC)  On-die BW is 512 GB/s, max sustainable memory BW is 64 GB/s  Performance: configuration (4 cores/node, 1M L2 and 32M L3) is the best  Performance per unit area: config (4 cores/node, 512K L2 and 8M L3) is the best  Performance3 per unit area: (4 cores/node, 512K to 1M L2, 8M to 16M L3)

Performance Evaluation (SAP, SPECjbb)

Implications and Inferences  design a 3-level cache hierarchy  Each node consists of four cores, 512K to 1M of L2 cache  The L3 cache size is recommended to be a minimum of 16M  Recommend that the platform support at least 64GB/s of memory bandwidth and 512GB/s of interconnect bandwidth

Summary  Performed the first study of performance, area and bandwidth implications on LCMP cache design  Introduced a constraints-aware analysis methodology to explore LCMP cache design options  Applied this methodology to a 32-core LCMP architecture  Quickly narrowed down the design space to a small subset of viable options  Conducted an in-depth performance/area evaluation of these options and summarize a set of recommendations for architecting efficient LCMP platforms

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

Similar presentations

Presentation on theme: "Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

Similar presentations

Presentation on theme: "Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don."— Presentation transcript:

Similar presentations

About project

Feedback