Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer Engineering North Carolina State University {skim16, dchandr,
PACT 20042Seongbeom Kim, NCSU L2 $ Cache Sharing in CMP L1 $ …… Processor Core 1Processor Core 2 L1 $
PACT 20043Seongbeom Kim, NCSU L2 $ Cache Sharing in CMP L1 $ …… Processor Core 1 L1 $ Processor Core 2 ←t1
PACT 20044Seongbeom Kim, NCSU L1 $ Processor Core 1 L1 $ Processor Core 2 L2 $ Cache Sharing in CMP …… t2→
PACT 20045Seongbeom Kim, NCSU L1 $ L2 $ Cache Sharing in CMP …… Processor Core 1Processor Core 2 ←t1 L1 $ t2→ t2’s throughput is significantly reduced due to unfair cache sharing.
PACT 20046Seongbeom Kim, NCSU Shared L2 cache space contention
PACT 20047Seongbeom Kim, NCSU Shared L2 cache space contention
PACT 20048Seongbeom Kim, NCSU Uniprocessor scheduling 2-core CMP scheduling Problems of unfair cache sharing –Sub-optimal throughput –Thread starvation –Priority inversion –Thread-mix dependent throughput Fairness: uniform slowdown for co-scheduled threads Impact of unfair cache sharing t1 t4 t1 t3 t2 t1 t2 t1 t3 t1 t2 t1 t3 t4 t1 P1: P2: time slice
PACT 20049Seongbeom Kim, NCSU Contributions Cache fairness metrics –Easy to measure –Approximate uniform slowdown well Fair caching algorithms –Static/dynamic cache partitioning Optimizing fairness –Simple hardware modifications Simulation results –Fairness: 4x improvement –Throughput 15% improvement Comparable to cache miss minimization approach
PACT Seongbeom Kim, NCSU Related Work Cache miss minimization in CMP: –G. Suh, S. Devadas, L. Rudolph, HPCA 2002 Balancing throughput and fairness in SMT: –K. Luo, J. Gummaraju, M. Franklin, ISPASS 2001 –A. Snavely and D. Tullsen, ASPLOS, 2000 –…
PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions
PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown Execution time of t i when it runs alone.
PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown Execution time of t i when it shares cache with others.
PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown We want to minimize: –Ideally:
PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown We want to minimize: –Ideally:
PACT Seongbeom Kim, NCSU Fairness Metrics Uniform slowdown We want to minimize: –Ideally:
PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions
PACT Seongbeom Kim, NCSU Partitionable Cache Hardware LRU P1: 448B P2 Miss P2: 576B Current Partition P1: 384B P2: 640B Target Partition Modified LRU cache replacement policy –G. Suh, et. al., HPCA 2002
PACT Seongbeom Kim, NCSU Partitionable Cache Hardware LRU * P1: 448B P2 Miss P2: 576B Current Partition P1: 384B P2: 640B Target Partition Modified LRU cache replacement policy –G. Suh, et. al., HPCA 2002 LRU * P1: 384B P2: 640B Current Partition P1: 384B P2: 640B Target Partition
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm P1: P2: Ex) Optimizing M3 metric P1: P2: Target Partition MissRate alone P1: P2: MissRate shared Repartitioning interval
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm 1 st Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1: P2: MissRate shared P1:20% P2:15% MissRate shared P1:256KB P2:256KB Target Partition
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm Repartition! Evaluate M3 P1: 20% / 20% P2: 15% / 5% P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:256KB P2:256KB Target Partition P1:192KB P2:320KB Target Partition Partition granularity: 64KB
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm 2 nd Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared P1:192KB P2:320KB Target Partition
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm Repartition! Evaluate M3 P1: 20% / 20% P2: 10% / 5% P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared P1:192KB P2:320KB Target Partition P1:128KB P2:384KB Target Partition
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm 3 rd Interval P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:10% MissRate shared P1:128KB P2:384KB Target Partition P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Algorithm Repartition! Do Rollback if: P2: Δ<T rollback Δ=MR old -MR new P1:20% P2: 5% MissRate alone Repartitioning interval P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared P1:128KB P2:384KB Target Partition P1:192KB P2:320KB Target Partition
PACT Seongbeom Kim, NCSU Fair Caching Overhead Partitionable cache hardware Profiling –Static profiling for M1, M3 –Dynamic profiling for M1, M3, M4 Storage –Per-thread registers Miss rate/count for “alone” case Miss rate/count for “shared’ case Repartitioning algorithm –< 100 cycles overhead in 2-core CMP –invoked at every repartitioning interval
PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions
PACT Seongbeom Kim, NCSU Evaluation Environment UIUC ’ s SESC Simulator –Cycle accurate CMP Cores 2 cores, each 4-issue dynamic. 3.2GHz Memory L1 I/D (private): WB, 32KB, 4way, 64B line, RT: 3cycles L2 Unified (shared): WB, 512KB, 8way, 64B line, RT: 14 cycles L2 replacement: LRU or Pseudo-LRU RT memory latency: 407 cycles
PACT Seongbeom Kim, NCSU Evaluation Environment ParameterValues Repartitioning granularity64KB Repartitioning interval10K, 20K, 40K, 80K L2 accesses T rollback 0%, 5%, 10%, 15%, 20%, 25%, 30% 18 benchmark pairs Algorithm Parameters Static algorithms: FairM1 Dynamic algorithms: FairM1Dyn, FairM3Dyn, FairM4Dyn
PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation –Correlation results –Static fair caching results –Dynamic fair caching results –Impact of rollback threshold –Impact of time interval Conclusions
PACT Seongbeom Kim, NCSU Correlation Results
PACT Seongbeom Kim, NCSU Correlation Results M1 & M3 show best correlation with M0.
PACT Seongbeom Kim, NCSU Static Fair Caching Results
PACT Seongbeom Kim, NCSU Static Fair Caching Results FairM1 has comparable throughput as MinMiss with better fairness
PACT Seongbeom Kim, NCSU Static Fair Caching Results Opt assures that better fairness is achieved without throughput loss.
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results FairM1Dyn, FairM3Dyn show best fairness and throughput.
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results Improvement in fairness results in throughput gain.
PACT Seongbeom Kim, NCSU Dynamic Fair Caching Results Fair caching sometimes degrades throughput (2 out of 18).
PACT Seongbeom Kim, NCSU Impact of Rollback Threshold in FairM1Dyn
PACT Seongbeom Kim, NCSU Impact of Rollback Threshold in FairM1Dyn ’20% T rollback ’ shows best fairness and throughput.
PACT Seongbeom Kim, NCSU Impact of Repartitioning Interval in FairM1Dyn
PACT Seongbeom Kim, NCSU Impact of Repartitioning Interval in FairM1Dyn ‘10K L2 accesses’ shows best fairness and throughput.
PACT Seongbeom Kim, NCSU Outline Fairness Metrics Static Fair Caching Algorithms (See Paper) Dynamic Fair Caching Algorithms Evaluation Environment Evaluation Conclusions
PACT Seongbeom Kim, NCSU Conclusions Problems of unfair cache sharing –Sub-optimal throughput –Thread starvation –Priority inversion –Thread-mix dependent throughput Contributions –Cache fairness metrics –Static/dynamic fair caching algorithms Benefits of fair caching –Fairness: 4x improvement –Throughput 15% improvement Comparable to cache miss minimization approach –Fair caching simplifies scheduler design –Simple hardware support
PACT Seongbeom Kim, NCSU Partitioning Histogram Mostly oscillating between two partitioning choices.
PACT Seongbeom Kim, NCSU Partitioning Histogram T rollback of 35% can still find better partition.
PACT Seongbeom Kim, NCSU Impact of Partition Granularity in FairM1Dyn 64KB shows best fairness and throughput.
PACT Seongbeom Kim, NCSU Impact of Initial Partition in FairM1Dyn Tolerable differences from various initial partition.
PACT Seongbeom Kim, NCSU Impact of Initial Partition in FairM1Dyn Initially equal partition alleviates local optimum problem.
PACT Seongbeom Kim, NCSU SpeedUp over Batch Scheduling FairM1Dyn, FairM3Dyn show best speedup.