Energy-Efficient Address Translation Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman S. Unsal
Executive Summary Problem: TLBs consume energy, especially hits Energy-Efficient Address Translation Insight: Increased TLB reach reduces TLB pressure Lite mechanism: selectively resize L1 TLB resources to save energy TLBLite design: commodity processors with huge pages RMMLite design: RMM with support for range translations [ISCA’15] Results 23% - 71% reduction in address translation energy
Outline Motivation + Opportunity Energy-Efficient Address Translation Results
Virtual Memory is not free Performance overhead due to page walks – previous work Energy overhead due to TLB lookups On every memory operation > 90% in L1 TLBs This work Leverage TLB reach to improve energy-efficiency TLBs *Sodani’s/Intel keynote at MICRO 2011
Base 4KB Pages Per-core TLB hierarchy Focus on data TLBs Core Hit Miss L1 TLB Per-core TLB hierarchy Focus on data TLBs L2 TLB Miss Page Walk Page Table Walker
Base 4KB Pages Performance overhead Energy overhead Up to 50% in page walks Energy overhead 60% in L1 TLB accesses 35% in page walks Page Table Walker L1 TLB L2 TLB Core
Huge Pages Core 4KB TLB L1 TLB 2MB TLB L2 TLB Page Table Walker
Huge Pages Performance improves, but.. Energy increases by 4% Separate L1 TLBs Up to 43% increase 91% of energy L1 TLBs Page Table Walker L2 TLB 4KB TLB 2MB TLB Core
Redundant Memory Mappings [ISCA ’15] Virtual Memory Physical Memory Range Translations Arbitrarily-large mappings between contiguous virtual pages to contiguous physical pages with uniform protection
Redundant Memory Mappings [ISCA ’15] Core 4KB TLB 2MB TLB L2 TLB L2 range TLB
Redundant Memory Mappings [ISCA ’15] Performance improves a lot Increases L2 TLB reach Eliminates page walks Energy still high The “innocent” L1 TLB hits! 98% of energy in L1 TLBs L2 TLB 4KB TLB 2MB TLB L2 range TLB Core
Our goal: improved performance and reduced energy State of the art 4KB Pages Huge Pages RMM Performance Dynamic Energy Our goal: improved performance and reduced energy
Key Observation Can we leverage increased TLB reach to save energy? Yes! Naturally take pressure off L1 TLBs Core Why access all L1 TLB entries, esp. for 4KB pages? Larger Reach Single 2MB entry == 512 x 4KB entries 4KB TLB 2MB TLB Look up fewer entries Reduce dynamic energy Similar performance L2 TLB
Outline Motivation + Opportunity Energy-Efficient Address Translation The Lite mechanism TLBLite design for huge pages RMMLite design for range translations Results
The Lite Mechanism Goal: Save TLB energy with similar performance Utility-based monitoring [Drophso et al. PACT ’02, Qureshi et al. MICRO ’06] Distance of TLB hits from MRU position Inferring utility of active ways Way disabling
The Lite Mechanism L1 TLB distance counters Track distance of hits 1 C1 C2 L1 TLB distance counters Track distance of hits Monitor utility C0 distance 0 C1 distance 1 C2 distance 2-3 MRU page 10 page 30 SET 0 page 90 page 50 distance == 0 LRU page 35 MRU page 35 page 25 SET 1 page 55 page 75 LRU
The Lite Mechanism distance == 1 C0 1 C1 1 C2 MRU SET 0 LRU page 30 1 C2 MRU page 10 page 30 SET 0 page 90 page 50 distance == 1 LRU page 30 MRU page 35 page 25 SET 1 page 55 page 75 LRU
The Lite Mechanism After many TLB accesses (interval ends) 95 C1 46 C2 2 MRU . . . . . . SET 0 . . . After many TLB accesses (interval ends) . . . Ways 2-3 less useful Disable them [Albonesi, MICRO’99] LRU MRU . . . . . . SET 1 . . . . . . LRU
The Lite Mechanism Save energy on every TLB lookup C0 C1 C2 MRU SET 0 C1 C2 MRU . . . . . . SET 0 . . . . . . Save energy on every TLB lookup LRU page XY MRU . . . . . . SET 1 . . . . . . LRU
TLBLite design Core 4KB TLB 2MB TLB Lite Lite L2 TLB
RMMLite design High hit ratio Fewer L1 TLB misses Disables more ways Core 4KB TLB L1 range TLB 2MB TLB Lite High hit ratio Fewer L1 TLB misses Disables more ways L2 TLB L2 range TLB
RMMLite design L1-range TLB small but efficient (4 entries) arbitrarily-large mappings L2 TLB L2 range TLB Lite Core 4KB TLB L1 range TLB Virtual Memory Physical Memory
Outline Motivation + Opportunity Energy-Efficient Address Translation Results
Methodology Developed MMU Simulator Pin, Cacti, and profiling with Linux Pagemap Baseline: Intel Sandy Bridge Dynamic energy & performance models For the address translation path TLB intensive workloads Spec2006, BioBench, and Parsec
Dynamic energy spent in address translation Miss Cycles Energy Cycles spent in address translation Dynamic energy spent in address translation Geometric mean (detailed results in paper)
Miss Cycles 4KB Energy (normalized to) L1 TLB L2 TLB 64 entries, 4-way L2 TLB 512 entries, 4-way High performance overhead page walks High energy overheads 60% in L1 TLB 35% in page walks
Miss Cycles 2MB Energy 4KB configuration + L1 2MB TLB Energy increases 32 entries, 4-way Performance improves, but.. Energy increases 4% on average and up to 43% Separate L1 TLBs
Miss Cycles TLBLite Energy 2MB configuration + Lite Similar performance with 2MB Reduces energy by 23% 49% of lookups with fewer than 4 ways in L1 4KB TLB
Miss Cycles RMM [ISCA ’15] Energy 2MB configuration + L2 range TLB 32 entries, fully assoc. Even better performance, but.. Energy is still high Similar to 2MB pages L1 TLBs
Miss Cycles RMMLite Energy RMM configuration + L1 range TLB + Lite 4 entries, fully assoc. + Lite Better performance vs. RMM Fewer L1 TLB misses Reduces dynamic energy by 71% 84% of hits from L1 range TLB 63% of lookups with 1 way in 4KB
Observation: Increased TLB reach reduces TLB pressure Summary 4KB pages Huge RMM Performance Dynamic Energy TLBLite RMMLite 23% 71% Observation: Increased TLB reach reduces TLB pressure
Thank you!
BACKUP SLIDES
Summary Problem: TLBs consume energy, especially hits Energy-Efficient Address Translation Insight: Increased TLB reach reduces TLB pressure Lite mechanism: selectively resize L1 TLB resources to save energy TLBLite design: commodity processors with huge pages RMMLite design: RMM with support for range translations [ISCA’15] Results 23% - 71% reduction in address translation energy
Related Work Optimizing TLBs for energy-efficiency Circuit techniques [ISLPED ’97] Partitioning TLBs [ISLPED ’03, CASES ’06, ISLPED ’13] Filtering TLB requests [ISLPED ’05, TVLSI ’07] Dynamically resizing TLB [MICRO ’00] Single reference bit per TLB entry Targets monolithic TLB Selective TLB lookups [MICRO ’02, ISPASS ’04, CODES ’04] Compiler support and special registers
Related Work TLB-Pred [HPCA ’15] Virtual caches All page sizes into single set-associative TLB See paper for comparison Virtual caches Defer address translation until cache miss Increase hardware complexity (synonyms, protection)
Miss Cycles Overhead: TLB Intensive Workloads
Energy Overhead: TLB Intensive Workloads
Miss Cycles and Energy Overheads: Spec2006 TLBLite and RMMLite reduce the dynamic energy by 26% and 72%
Miss Cycles and Energy Overheads: PARSEC TLBLite and RMMLite reduce the dynamic energy by 20% and 66%
Percentage of L1 TLB lookups with active ways TLBLite RMMLite 4-ways 2-ways 1-way 4KB 51.2% 32.9% 15.9% 25.9% 10.4% 63.7% 2MB 81.1% 9.0% 9.9% -----
Distribution of L1 TLB hits TLBLite RMMLite 4KB 2MB Range 15.9% 35.6% 84.1%
Comparison with TLB-Pred [HPCA ’15] Same as 2MB conf. Perfect prediction Both L1 and L2 TLB hold 4KB and 2MB pages Improved performance Energy reduces vs. 2MB conf. Still, RMMLite more energy-efficient Orthogonal to range translations
Impact of Eager Paging