Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

16.317: Microprocessor System Design I
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Virtual Memory Hardware Support
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Virtual Memory.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
ECE 232 L27.Virtual.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 27 Virtual.
Translation Buffers (TLB’s)
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 24 Instructor: L.N. Bhuyan
CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.
Virtual Memory Expanding Memory Multiple Concurrent Processes.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-8 Memory Management (2) Department of Computer Science and Software.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
University of Amsterdam Computer Systems – virtual memory Arnoud Visser 1 Computer Systems Virtual Memory.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Virtual Memory.  Next in memory hierarchy  Motivations:  to remove programming burdens of a small, limited amount of main memory  to allow efficient.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
CS203 – Advanced Computer Architecture Virtual Memory.
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
CS 704 Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Improving Memory Access The Cache and Virtual Memory
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
Section 9: Virtual Memory (VM)
Today How was the midterm review? Lab4 due today.
Page Table Implementation
CS 704 Advanced Computer Architecture
Virtual Memory.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Memory Hierarchy Virtual Memory, Address Translation
Morgan Kaufmann Publishers
CSE 153 Design of Operating Systems Winter 2018
CSCI206 - Computer Organization & Programming
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS 105 “Tour of the Black Holes of Computing!”
Translation Lookaside Buffer
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Virtual Memory Nov 27, 2007 Slide Source:
Translation Buffers (TLB’s)
Virtual Memory Overcoming main memory size limitation
Translation Buffers (TLB’s)
CSC3050 – Computer Architecture
CS 105 “Tour of the Black Holes of Computing!”
CS 105 “Tour of the Black Holes of Computing!”
Lecture 8: Efficient Address Translation
Code Transformation for TLB Power Reduction
CS703 - Advanced Operating Systems
CSE 153 Design of Operating Systems Winter 2019
Fundamentals of Computing: Computer Architecture
Translation Buffers (TLBs)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
CS161 – Design and Architecture of Computer
Virtual Memory.
Review What are the advantages/disadvantages of pages versus segments?
Presentation transcript:

B2P2: Bounds Based Procedure Placement for Instruction TLB Power Reduction in Embedded Systems Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab Arizona State University, Arizona, USA 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Application’s view of the Memory CPU 0: 1: N-1: Memory Store 0x10 Load 0xf0 Memory Image for MIPS Process Applications assume whole memory is available Best if only one application running – Extremely embedded system 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Virtual Memory : Applications independent of Memory CPU 0: 1: N-1: Memory Load 0xf0 P-1: Page Table Store 0x10 Disk Virtual Addresses Physical Application independent of memory Can add/remove memory Can compile application independent of others 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Virtual Memory: Protection and Access Rights Page Tables Process i: Physical Addr Read? Write? PP 9 Yes No PP 4 XXXXXXX VP 0: VP 1: VP 2: • Process j: 0: 1: N-1: Memory PP 6 Page tables contain access right bits hardware enforces this protection (trap into OS if violation occurs) 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

VM and Cache – Physical Cache VA PA miss Trans- lation Cache Main Memory CPU hit data Physically Addressed Cache Accessed by physical addresses Allows multiple processes to have blocks in cache at same time Allows multiple processes to share pages Perform Address Translation Before Cache Lookup 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Translation Cache: TLB CPU TLB Lookup Cache Main Memory VA PA miss hit data Trans- lation “Translation Look-aside Buffer” (TLB) Small, usually fully associative cache Maps virtual page numbers to physical page numbers Contains complete page table entries for small number of pages 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Virtually-Indexed Cache TLB Lookup Cache VA PA CPU Data Tag = Hit Index Cache Index Determined from Virtual Address Can begin cache and TLB index at same time Cache Physically Tagged Cache tag indicates physical address Compare with TLB result to see if match Only then is it considered a hit Most embedded processors use VIPT Caches ARM processors, Hitachi SH3 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

TLB Power and Power-Density Hitachi SH-3 and Intel StrongARM, instruction and data TLBs together can consume over 15% of the overall on-chip budget Intel XScale Data Cache – 32 KB Data TLB – 32 entries, fully associative Both are accessed same no. of times Power density of TLB > 10X caches Important hotspot Support for small pages – high miss rate Important to reduce TLB Power in embedded systems 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

First came Hardware Approaches Banked-Promotion TLB [Manne, 1997] Support multiple page sizes Can use only half the TLB Multiple bank TLB [Lee, ISPLED’03] Two-level TLB [Hyuck, Comp. Arch. Letters, 03 ] Use-last TLB [Clark, ISLPED’03] 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Hybrid Approaches Translation Registers (TR) [Kandemir, 04, 05] TR can hold some of the commonly needed translations Instructions are inserted to use them, instead of TLB lookup 3.5% performance overhead 32.6% reduction in TLB lookups. We propose hybrid approach over use-last TLB architecture 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Compiler can further reduce iTLB Power Use-Last TLB LAST MATCH Triggered iff page Numbers differ Use-last TLB architecture Lookup happens only when page changes Achieves 75% power savings in iTLB Implemented in the Intel XScale processor Make animations to explain the power on page-switching Compiler can further reduce iTLB Power 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Page Switches in Instruction Memory Function_1() Function_2(); PSF:Function-Call Page-Switches PSL:Loop-Execution Page-Switches PAGE 1 PAGE 2 PAGE 3 Loop1 Function_2() Loop2 PSS:Sequential-Execution Page-Switches 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Code Placement Affects PSs PAGE 1 PAGE 2 PAGE 3 Function_1() Loop1 (10) Function_2(); Function_2() Loop2 (10) Function_3(); Function_3() Loop3 (10) 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Code Placement Affects PSs PAGE 1 PAGE 2 PAGE 3 Function_1() Loop1 (10) Function_2(); PSF 1 PSL 10 Function_2() Loop2 (10) Function_3(); 10x1=10 10x10 =100 Function_3() Loop3 (10) 10 x10=100 Total Page-Switches = 10+(100+100+10)+10+1 = 231 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Code Placement Affects PSs PAGE 1 PAGE 2 PAGE 3 Function_1() Loop1 (10) Function_2(); 10 1 nop() Function_2() Loop2 (10) Function_3(); Add two more placement techniques showing basic-block level and also instruction level granularity of code placement. Function_3() Loop3 (10) 10 x10 = 100 Total Page-Switches = 1+10+100+10+100+1 = 222 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Procedure Placement Problem Function_3() Loop3 (10) Function_2() Loop2 (10) Function_3(); Function_1() Loop1 (10) Function_2(); PAGE 1 PAGE 2 PAGE 3 1 10 nop() Function_1() Loop1 (10) Function_2(); Inputs Function size Loop offset Function call offset Loop size Loop count Output Function start address Optimization Minimize Page Switches Function_2() Loop2 (10) Function_3(); Function_3() Loop3 (10) NP-Complete GPP can be reduced to PPP Total Page-Switches = 1+10+10 = 21 B2P2: Our Heuristic 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

B2P2 Heuristic Demonstration Elements_List: <Id, Count, [offsets from Fn.SA]> Pagewise Bounds_List: F3() L3 (10) F2() L2 (10) F3(); F1() L1 (10) F2(); 100 200 10 20 70 90 120 140 160 240 300 F3() L3 (10) F2() L2 (10) F3(); F1() L1 (10) F2(); 100 200 10 20 70 130 150 280 300 nop() Page 1 L1, 10, [20-70] L2, 10, [10-60] CS.F3(), 10, [40] L3, 10, [10-90] Page 2 Bounds B formed for the Element (To reduce page-switches(if any)) Page 3 200 ≤ F3.SA+10 < F3.SA+90 ≤ 300 100 ≤ F2.SA < F2.SA+70 ≤ 200 && 100 ≤ F3.SA ≤ 200 100 ≤ F2.SA+10 < F2.SA+60 ≤ 200 100 ≤ F2.SA < F2.SA+70 ≤ 200 && 100 ≤ F3.SA + 100 ≤ 200 0 ≤ F1.SA+20 < F1.SA+70 ≤ 100 Sizeof(F1)+Sizeof(F2) > PageSize Therefore next element in new page Placing End of F3 within Page 2 Conflicts with existing bounds  Drop bound Placing Loop L1 within Page 1 Placing Start of F3 within Page 2 Placing Loop L3 within Page 3 Placing Loop L2 within Page 2 Function Sizes F1() : 80 F2() : 70 F3() : 100 Loop Sizes L1 : 50 L2 : 50 L3 : 80 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

B2P2 Heuristic Flow-Chart Sorted by decreasing weights (call-count, loop count) <Page_List>P Pages occupied by program Program + Profile Information Generate edge-annotated DCFG Element E from top of List Form bound B: (to remove page-switches by element) B = [ P.SA , Element , P.EA ] DCFG <Functions_List>F List of Functions in the program <Elements_List>E Call-Sites, Loops Does B conflict with existing PBi.bounds? <Page_Bounds>PB Set of bounds placing functions into each page. For each page PBi NO Add to page-bounds list PBi PB.bounds exists for function Fx associated to E? YES YES Get next element E Disregard the bound B Add bound B to a new page. NO Add to page-bounds list Pbi+1 Continue More details in the paper 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Page-Switch Reduction by B2P2 Large nested loop structures restrict the optimization Large sized functions with high call-counts are rightly analyzed by B2P2 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Summary TLB consumes significant power; is important hotspot Need to reduce TLB power Use-last architecture is effective in reducing TLB power 75% on dhrystone Look-up only on a page switch TLB power can be further reduced by code placement Optimization is NP-complete Proposed a greedy heuristic B2P2 Adjusts function placement to minimize page switches in loops and function calls 76% reduction in page-switches Over and above that achieved by Use-last architecture < 2% performance impact < 5% increase in code-size 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/

Future Work We perform code relocation at function-level granularity Procedure placement at basic-block level can be implemented and performance tradeoff analyzed Main challenge is not to impact performance We implemented greedy knapsack heuristic Formulating the PPP as a network-flow problem may be promising Techniques for data TLB power reduction Use-last by itself is not very effective Data placement can affect cache behavior May not be effective in low-associativity caches 5/6/2019 http://www.public.asu.edu/~ashriva6/cml/