Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.

Slides:

Advertisements

Similar presentations

Advertisements

Page Table Implementation

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

4/14/2017 Discussed Earlier segmentation - the process address space is divided into logical pieces called segments. The following are the example of types.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.

April 27, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Putting it all together: Intel Nehalem Steve Ko Computer Sciences and Engineering University.

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

Memory Management.

Principle Behind Hierarchical Storage  Each level memorizes values stored at lower levels  Instead of paying the full latency for the “furthermost” level.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 24 Instructor: L.N. Bhuyan

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.

Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.

1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix B Authors: John Hennessy & David Patterson.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.

Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.

Virtual Memory Expanding Memory Multiple Concurrent Processes.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Chapter 91 Logical Address in Paging  Page size always chosen as a power of 2.  Example: if 16 bit addresses are used and page size = 1K, we need 10.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.

Redundant Memory Mappings for Fast Access to Large Memories

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Page Table Implementation. Readings r Silbershatz et al:

8.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Fragmentation External Fragmentation – total memory space exists to satisfy.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.

CS203 – Advanced Computer Architecture Virtual Memory.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Dynamic Associative Caches:

CMSC 611: Advanced Computer Architecture

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Prof. Hsien-Hsin Sean Lee

Computer Structure Multi-Threading

‘99 ACM/IEEE International Symposium on Computer Architecture

Copyright © 2011, Elsevier Inc. All rights Reserved.

Energy-Efficient Address Translation

Figure 8.1 Architecture of a Simple Computer System.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Page that info back into your memory!

Ka-Ming Keung Swamy D Ponpandi

ICIEV 2014 Dhaka, Bangladesh

Virtual Memory فصل هشتم.

Reducing Cache Traffic and Energy with Macro Data Load

Computer System Design Lecture 11

Translation Lookaside Buffers

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer Engineering College of Computing Georgia Institute of Technology Atlanta, GA 30332

Ballapuram et al., Georgia Tech 2 Background  Address Translation  Major power processor power contributors  I-TLB and D-TLB lookup for every instruction and memory reference  TLBs are highly associative  Multi-porting increasing power consumption

Ballapuram et al., Georgia Tech 3 Outline  Motivation  Unique access behavior and locality are analyzed for energy reduction opportunities  Synonymous Address Compaction  Intra-Cycle Compaction  Inter-Cycle Compaction  Implementation Details  Performance/Energy Evaluation  Conclusions

Ballapuram et al., Georgia Tech 4 Breakdown of d-TLB accesses  More than 1 d-TLB lookup for 58% accesses (4-wide machine)  They often access the same page (intra-cycle synonymous accesses) % of data TLB accesses

Ballapuram et al., Georgia Tech 5 Breakdown of Synonymous Intra-cycle Accesses in d-TLB  ~30% of accesses have synonyms indicating redundancy  With intra-cycle compaction, 1/2 of syn(1) accesses, 2/3 of syn(2) accesses, and 3/4 of syn(3) accesses can be eliminated % of data TLB accesses

Ballapuram et al., Georgia Tech 6 Inter-cycle Reuse of d-TLB Translations  Inter-cycle synonymous accesses  68% of accesses could reuse the last address translation  More reuses can be achieved by partitioning dTLB into stack (99%), global (82%), and heap (75%) % of data TLB accesses

Ballapuram et al., Georgia Tech 7 Dynamic Data Memory Distribution  ~40 % of the dynamic memory accesses go to the stack which is concentrated on only few pages  4 memory accesses ~= 2 stack, 1 global and 1 heap

Ballapuram et al., Georgia Tech 8 Semantic-Aware Memory Architecture To Processor Unified L2 Cache Data Address Router gCache hCache ld_data_base_reg ld_env_base_reg ld_data_bound_reg gTLB To Processor Virtual address uTLB Most of the memory accesss go to smaller stack and global TLB/cache  Reducing power sTLB 0 1 sCache

Ballapuram et al., Georgia Tech 9 VPN compaction mechanisms VPN compaction mechanisms 0xdeadbeee0xdeadbeef0xdeadbef0Cycle i Cycle (i+1)0xdeadbef20xdeadbeef0x xffffffff xdeadb Cycle i Cycle (i+1)0xdeadb 0x xfffff Virtual address access sequence VPN translation lookup in d-TLB

Ballapuram et al., Georgia Tech 10 VPN compaction mechanisms VPN compaction mechanisms 0xdeadbeee0xdeadbeef0xdeadbef0Cycle i Cycle (i+1)0xdeadbef20xdeadbeef0x xffffffff Intra-cycle compaction 0xdeadb Cycle i Cycle (i+1)0xdeadb 0x xfffff Virtual address access sequence VPN translation lookup in d-TLB 0xdeadb----- Cycle i Cycle (i+1)0xdeadb-----0x xffffffff VPNs after intra-cycle compaction

Ballapuram et al., Georgia Tech 11 VPN compaction mechanisms VPN compaction mechanisms 0xdeadbeee0xdeadbeef0xdeadbef0Cycle i Cycle (i+1)0xdeadbef20xdeadbeef0x xffffffff Intra-cycle compaction 0xdeadb Cycle i Cycle (i+1)0xdeadb 0x xfffff Virtual address access sequence VPN translation lookup in d-TLB Inter-cycle compaction 0xdeadb----- Cycle i Cycle (i+1)0xdeadb-----0x xffffffff VPNs after intra-cycle compaction 0xdeadb Cycle i Cycle (i+1) x xfffff VPNs after inter-cycle compaction

Ballapuram et al., Georgia Tech 12 Intra-cycle compaction mechanism Reservation Station AGUsFPUsIUs Load Buffer Store Buffer Six 20-bit comparators 32-entry fully-associative Data TLBs Memory Order Buffer Physical Address AGUsIUs

Ballapuram et al., Georgia Tech 13 Comparator Logic

Ballapuram et al., Georgia Tech 14 Inter-cycle Compaction Mechanism To Processor Unified L2 Cache Data Address Router gCache hCache ld_data_base_reg ld_env_base_reg ld_data_bound_reg gTLB To Processor Virtual address uTLB 0 32 sCache sTLB 0 1 MRU Latch last access reuse

Ballapuram et al., Georgia Tech 15 Execution EngineOut-of-Order Fetch / Decode / Issue / Commit4 / 4 / 4 / 4 L1 / L2 / Memory Latency1 / 6 / 150 TLB hit / miss latency1 / 30 L1 Cache baselineDM 32KB, 32B L2 Cache4w 512KB, 32B Number of TLB entries32 Each 20-bit comparator power300 uW Each MRU latch power in TLB140 uW Simulation Parameters

Ballapuram et al., Georgia Tech 16 Energy Savings via Synonymous Compaction  Intra-cycle compaction  27%  Inter-cycle compaction  42%  Inter-cycle semantic-aware  56% data TLB Energy Savings %

Ballapuram et al., Georgia Tech 17 Performance Impact w/ Synonymous Compaction  Intra-cycle compaction  9%  Inter-cycle compaction  8%  Inter-cycle semantic-aware  4% Performance Speedup

Ballapuram et al., Georgia Tech 18 I- and d-TLB Energy Savings via Synonymous Compaction  Combining compaction for iTLB and dTLB gives 85% and 52% energy savings  Overall 70% TLB energy savings  Using semantic-aware, overall 76% energy savings TLB Energy Savings %

Ballapuram et al., Georgia Tech 19  Combining compaction for iTLB and dTLB have 5% and 13% performance impact  Using semantic-aware, overall 13% performance impact Performance Speedup I- and d-TLB Performance Impact w/ Synonymous Compaction

Ballapuram et al., Georgia Tech 20 Conclusions  Consecutive TLB accesses are highly synonymous  Proposed synonymous address compaction to exploit this behavior  Reduce energy for d-TLB and i-TLB  Energy savings and performance impact  Intra-cycle  27% and 9%  Inter-cycle  42% and 8%  Semantic-aware  56% and 4%

Q and A