Code Transformation for TLB Power Reduction

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
PART 4: (2/2) Central Processing Unit (CPU) Basics CHAPTER 13: REDUCED INSTRUCTION SET COMPUTERS (RISC) 1.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
Presented By Srinivas Sundaravaradan. MACH µ-Kernel system based on message passing Over 5000 cycles to transfer a short message Buffering IPC L3 Similar.
Computer ArchitectureFall 2008 © CS : Computer Architecture Lecture 22 Virtual Memory (1) November 6, 2008 Nael Abu-Ghazaleh.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Paging. Memory Partitioning Troubles Fragmentation Need for compaction/swapping A process size is limited by the available physical memory Dynamic growth.
Computer Architecture Lecture 28 Fasih ur Rehman.
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-8 Memory Management (2) Department of Computer Science and Software.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
4.3 Virtual Memory. Virtual memory  Want to run programs (code+stack+data) larger than available memory.  Overlays programmer divides program into pieces.
Page Table Implementation. Readings r Silbershatz et al:
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
CDA 5155 Virtual Memory Lecture 27. Memory Hierarchy Cache (SRAM) Main Memory (DRAM) Disk Storage (Magnetic media) CostLatencyAccess.
1 Lecture 5a: CPU architecture 101 boris.
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Vivek Seshadri 15740/18740 Computer Architecture
The Memory System (Chapter 5)
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Memory and cache CPU Memory I/O.
Dynamic Branch Prediction
Lecture Topics: 11/19 Paging Page tables Memory protection, validation
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Page Table Implementation
The University of Adelaide, School of Computer Science
Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?
Memory Hierarchy Virtual Memory, Address Translation
Cache Memory Presentation I
Morgan Kaufmann Publishers
CS510 Operating System Foundations
Lecture 23: Cache, Memory, Security
CMSC 611: Advanced Computer Architecture
Memory and cache CPU Memory I/O.
Lecture 23: Cache, Memory, Virtual Memory
FIGURE 12-1 Memory Hierarchy
Lecture 22: Cache Hierarchies, Memory
Register Pressure Guided Unroll-and-Jam
Lecture 24: Memory, VM, Multiproc
Virtual Memory Hardware
Translation Lookaside Buffer
Overheads for Computers as Components 2nd ed.
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Spring 2008 CSE 591 Compilers for Embedded Systems
© 2004 Ed Lazowska & Hank Levy
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Chapter 12 Pipelining and RISC
Dynamic Hardware Prediction
Lecture 8: Efficient Address Translation
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
CS703 - Advanced Operating Systems
Loop-Level Parallelism
Fundamentals of Computing: Computer Architecture
Translation Buffers (TLBs)
Cache writes and examples
Review What are the advantages/disadvantages of pages versus segments?
4.3 Virtual Memory.
Page Main Memory.
Presentation transcript:

Code Transformation for TLB Power Reduction Reiley Jeyapaul, Sandeep Marathe, and Aviral Shrivastava Compiler Microarchitecture Laboratory Arizona State University 5/2/2019 http://www.public.asu.edu/~ashriva6

Translation Lookaside Buffer Translation table for addresses translation and page access permissions TLB required for Memory Virtualization Application programmers see a single, almost unlimited memory Page access control, for privacy and security TLB access for every memory access Translation can be done only at miss But page access permissions needed on every access TLB part of multi-processing environments Part of Memory Management Unit (MMU) Add a diagram showing TLB in parallel to cache 5/2/2019 http://www.public.asu.edu/~ashriva6

TLB Power Consumption TLB typically implemented as a fully associative cache 8-4096 entries High speed dynamic domino logic circuitry used Very frequently accessed Every memory instruction TLB can consume 20-25% of cache power[9] TLB can have power density ~ 2.7 nW/mm2 [16] More than 4 times that of L1 cache. Important to reduce TLB Power [9] M. Ekman, P. Stenstrm, and F. Dahlgren. TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors. In ISLPED ’02, pages 243–246, New York, NY, USA, 2002. ACM Press [16] I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, and G. Chen. Optimizing instruction TLB energy using software and hardware techniques. ACM Trans. Des. Autom. Electron. Syst., 10(2):229–257, 2005.

Related Work Hardware Approaches Software Approaches Banked Associative TLB 2-level TLB Use-last TLB Software Approaches Semantic aware multi-lateral partitioning Translation Registers (TR) to store most frequently used TLB translations Compiler-directed code restructuring Optimize the use of TRs No Hardware-software cooperative approach 5/2/2019 http://www.public.asu.edu/~ashriva6

Use-Last TLB Architecture “WL” is not enabled if the immediate previous tag and the current tag addresses (page addresses) are the same Achieves 75% power savings in I-TLB Deemed ineffective for D-TLB, due to low page locality Need to improve program page-locality 5/2/2019 http://www.public.asu.edu/~ashriva6

Code Generation and TLB Page Switches for (i=1; i < N; i++) for (j=1; j < N; j++) prediction = 2 * A[i-1][j-1] + A[i-1][j] + A[i][j-1]; A[i][j] = A[i][j] – prediction; endFor ArraySize( A ) > Page_Size A[i-1][j] and A[i][j-1] access different pages # Page-Switch = 4 T1 = A[i][ j] – 2*A[i-1][ j-1]; T2 = A[i][ j-1] + A[i-1][ j]; A[i][j] = T1 – T2; A[i][ j], A[i][ j-1] A[i-1][ j], A[i-1][ j-1] Page 1 Page 2 High Page Switch Solution # Page-Switch = 1 T1 = 2*A[i-1][ j-1] + A[i-1][ j]; T2 = A[i][ j] - A[i][ j-1]; A[i][j] = T2 – T1; A[i][ j], A[i][ j-1] A[i-1][ j], A[i-1][ j-1] Page 1 Page 2 Low Page Switch Solution 5/2/2019 http://www.public.asu.edu/~ashriva6

Outline Motivation for TLB power reduction Use-last TLB architecture Intuition of Compiler techniques for TLB power reduction Compiler Techniques Instruction Scheduling Problem Formulation Heuristic Solution Array Interleaving Loop Unrolling Comprehensive Solution Summary 5/2/2019 http://www.public.asu.edu/~ashriva6

Page Switching Model Represent instruction by a 4-tuple d: destination operand, s1 : first source operand, s2: second source operand When instruction executes, assume that operands are accessed in the order, i.s1, i.s2, i.d Need to estimate the number of page switches for a sequence of instructions PS(p, i1, i2, …, in) = PS(p, i1.s1, i1.s2, i1.d, i1.d, i2.s1, i2.s2, i2.d, …, in-1.d, in.s1, in.s2, in.d) Page Mapping Scalars : undef Globals: p1 Local Arrays Different arrays map to different pages Find dimension, such that size of array in lower dimensions > page size Any difference in higher dimension index is a different page 5/2/2019 http://www.public.asu.edu/~ashriva6

Problem Formulation Source Node Data Dependence Edge Instruction node Page-Switch Edge 1 1 2 3 Weight = # of page switches when node “i” is scheduled immediately next to node “j” 2 2 3 1 2 5 4 Instruction schedule for minimum page-switch = Finding shortest hamiltonian from source to sink. 3 1 2 7 6 Sink Node 5/2/2019 http://www.public.asu.edu/~ashriva6

Heuristic Solution Greedy Solution: Pick source of PNSE at priority After scheduling (1) Can pick up (2) or (3) Picking up (3) is a bad idea Loose the opportunity to reduce page 1 2 4 6 3 5 7 Data Dependence Edge 1 2 3 Page-Non-Switching Edge (PNSE) 5 4 Our Solution Pick up PNSE edges greedily 7 6 5/2/2019 http://www.public.asu.edu/~ashriva6

Experimental Results 23% reduction in TLB switching by instruction scheduling 5/2/2019 http://www.public.asu.edu/~ashriva6

Outline Motivation for TLB power reduction Use-last TLB architecture Intuition of Compiler techniques for TLB power reduction Compiler Techniques Instruction Scheduling Array Interleaving Loop Unrolling Comprehensive Solution Summary 5/2/2019 http://www.public.asu.edu/~ashriva6

Array Interleaving Arrays are interleaving candidates if Array size > Page size. Arrays accessed successively before interleaving. Arrays accessed successively after interleaving. Arrays are interleaving candidates if arrays have the same access function arrays are the same size padding leads to memory usage and addressing overheads. Multi-Array Interleaving If arrays A and B are interleaving candidates for loop 1, and B and C for loop 2, then arrays A,B and C are interleaved together. 5/2/2019 http://www.public.asu.edu/~ashriva6

Experimental Results 35% reduction in TLB switching by AI 5/2/2019 http://www.public.asu.edu/~ashriva6

Effect of Loop Unrolling Loop unrolling can only improve effectiveness of page switch reduction Loop unrolling is done if there exists one instruction in the loop such that: two copies of the same instruction over successive iterations, scheduled together, will reduce page-switches. Unrolling further reduces TLB switching 5/2/2019 http://www.public.asu.edu/~ashriva6

Outline Motivation for TLB power reduction Use-last TLB architecture Intuition of Compiler techniques for TLB power reduction Compiler Techniques Instruction Scheduling Array Interleaving Loop Unrolling Comprehensive Solution Summary 5/2/2019 http://www.public.asu.edu/~ashriva6

Comprehensive Technique Fundamental transformations for PS reduction: Instruction Scheduling Array Interleaving Enhancement transformations: Loop unrolling after all re-scheduling options are exploited Order of transformations: Loop unrolling 61% reduction in page switches for 6.4% performance loss 5/2/2019 http://www.public.asu.edu/~ashriva6

Summary TLB may consumes significant power, and also has high power density Important to reduce TLB power Use-last TLB architecture Access to the same page does not cause TLB switching Effective for I-TLB, but need compiler techniques to improve data locality for D-TLB Presented Compiler techniques for TLB power reduction Instruction Scheduling Array Interleaving Loop Unrolling Reduce TLB power by 61% at 6% performance loss Very effective hardware-software cooperative technique 5/2/2019 http://www.public.asu.edu/~ashriva6