Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Similar presentations


Presentation on theme: "Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab"— Presentation transcript:

1 B2P2: Bounds Based Procedure Placement for Instruction TLB Power Reduction in Embedded Systems
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab Arizona State University, Arizona, USA 5/6/2019

2 Application’s view of the Memory
CPU 0: 1: N-1: Memory Store 0x10 Load 0xf0 Memory Image for MIPS Process Applications assume whole memory is available Best if only one application running – Extremely embedded system 5/6/2019

3 Virtual Memory : Applications independent of Memory
CPU 0: 1: N-1: Memory Load 0xf0 P-1: Page Table Store 0x10 Disk Virtual Addresses Physical Application independent of memory Can add/remove memory Can compile application independent of others 5/6/2019

4 Virtual Memory: Protection and Access Rights
Page Tables Process i: Physical Addr Read? Write? PP 9 Yes No PP 4 XXXXXXX VP 0: VP 1: VP 2: Process j: 0: 1: N-1: Memory PP 6 Page tables contain access right bits hardware enforces this protection (trap into OS if violation occurs) 5/6/2019

5 VM and Cache – Physical Cache
VA PA miss Trans- lation Cache Main Memory CPU hit data Physically Addressed Cache Accessed by physical addresses Allows multiple processes to have blocks in cache at same time Allows multiple processes to share pages Perform Address Translation Before Cache Lookup 5/6/2019

6 Translation Cache: TLB
CPU TLB Lookup Cache Main Memory VA PA miss hit data Trans- lation “Translation Look-aside Buffer” (TLB) Small, usually fully associative cache Maps virtual page numbers to physical page numbers Contains complete page table entries for small number of pages 5/6/2019

7 Virtually-Indexed Cache
TLB Lookup Cache VA PA CPU Data Tag = Hit Index Cache Index Determined from Virtual Address Can begin cache and TLB index at same time Cache Physically Tagged Cache tag indicates physical address Compare with TLB result to see if match Only then is it considered a hit Most embedded processors use VIPT Caches ARM processors, Hitachi SH3 5/6/2019

8 TLB Power and Power-Density
Hitachi SH-3 and Intel StrongARM, instruction and data TLBs together can consume over 15% of the overall on-chip budget Intel XScale Data Cache – 32 KB Data TLB – 32 entries, fully associative Both are accessed same no. of times Power density of TLB > 10X caches Important hotspot Support for small pages – high miss rate Important to reduce TLB Power in embedded systems 5/6/2019

9 First came Hardware Approaches
Banked-Promotion TLB [Manne, 1997] Support multiple page sizes Can use only half the TLB Multiple bank TLB [Lee, ISPLED’03] Two-level TLB [Hyuck, Comp. Arch. Letters, 03 ] Use-last TLB [Clark, ISLPED’03] 5/6/2019

10 Hybrid Approaches Translation Registers (TR) [Kandemir, 04, 05]
TR can hold some of the commonly needed translations Instructions are inserted to use them, instead of TLB lookup 3.5% performance overhead 32.6% reduction in TLB lookups. We propose hybrid approach over use-last TLB architecture 5/6/2019

11 Compiler can further reduce iTLB Power
Use-Last TLB LAST MATCH Triggered iff page Numbers differ Use-last TLB architecture Lookup happens only when page changes Achieves 75% power savings in iTLB Implemented in the Intel XScale processor Make animations to explain the power on page-switching Compiler can further reduce iTLB Power 5/6/2019

12 Page Switches in Instruction Memory
Function_1() Function_2(); PSF:Function-Call Page-Switches PSL:Loop-Execution Page-Switches PAGE 1 PAGE 2 PAGE 3 Loop1 Function_2() Loop2 PSS:Sequential-Execution Page-Switches 5/6/2019

13 Code Placement Affects PSs
PAGE 1 PAGE 2 PAGE 3 Function_1() Loop1 (10) Function_2(); Function_2() Loop2 (10) Function_3(); Function_3() Loop3 (10) 5/6/2019

14 Code Placement Affects PSs
PAGE 1 PAGE 2 PAGE 3 Function_1() Loop1 (10) Function_2(); PSF 1 PSL 10 Function_2() Loop2 (10) Function_3(); 10x1=10 10x10 =100 Function_3() Loop3 (10) 10 x10=100 Total Page-Switches = 10+( )+10+1 = 231 5/6/2019

15 Code Placement Affects PSs
PAGE 1 PAGE 2 PAGE 3 Function_1() Loop1 (10) Function_2(); 10 1 nop() Function_2() Loop2 (10) Function_3(); Add two more placement techniques showing basic-block level and also instruction level granularity of code placement. Function_3() Loop3 (10) 10 x10 = 100 Total Page-Switches = = 222 5/6/2019

16 Procedure Placement Problem
Function_3() Loop3 (10) Function_2() Loop2 (10) Function_3(); Function_1() Loop1 (10) Function_2(); PAGE 1 PAGE 2 PAGE 3 1 10 nop() Function_1() Loop1 (10) Function_2(); Inputs Function size Loop offset Function call offset Loop size Loop count Output Function start address Optimization Minimize Page Switches Function_2() Loop2 (10) Function_3(); Function_3() Loop3 (10) NP-Complete GPP can be reduced to PPP Total Page-Switches = = 21 B2P2: Our Heuristic 5/6/2019

17 B2P2 Heuristic Demonstration
Elements_List: <Id, Count, [offsets from Fn.SA]> Pagewise Bounds_List: F3() L3 (10) F2() L2 (10) F3(); F1() L1 (10) F2(); 100 200 10 20 70 90 120 140 160 240 300 F3() L3 (10) F2() L2 (10) F3(); F1() L1 (10) F2(); 100 200 10 20 70 130 150 280 300 nop() Page 1 L1, 10, [20-70] L2, 10, [10-60] CS.F3(), 10, [40] L3, 10, [10-90] Page 2 Bounds B formed for the Element (To reduce page-switches(if any)) Page 3 200 ≤ F3.SA+10 < F3.SA+90 ≤ 300 100 ≤ F2.SA < F2.SA+70 ≤ 200 && 100 ≤ F3.SA ≤ 200 100 ≤ F2.SA+10 < F2.SA+60 ≤ 200 100 ≤ F2.SA < F2.SA+70 ≤ 200 && 100 ≤ F3.SA ≤ 200 0 ≤ F1.SA+20 < F1.SA+70 ≤ 100 Sizeof(F1)+Sizeof(F2) > PageSize Therefore next element in new page Placing End of F3 within Page 2 Conflicts with existing bounds  Drop bound Placing Loop L1 within Page 1 Placing Start of F3 within Page 2 Placing Loop L3 within Page 3 Placing Loop L2 within Page 2 Function Sizes F1() : 80 F2() : 70 F3() : 100 Loop Sizes L1 : 50 L2 : 50 L3 : 80 5/6/2019

18 B2P2 Heuristic Flow-Chart
Sorted by decreasing weights (call-count, loop count) <Page_List>P Pages occupied by program Program + Profile Information Generate edge-annotated DCFG Element E from top of List Form bound B: (to remove page-switches by element) B = [ P.SA , Element , P.EA ] DCFG <Functions_List>F List of Functions in the program <Elements_List>E Call-Sites, Loops Does B conflict with existing PBi.bounds? <Page_Bounds>PB Set of bounds placing functions into each page. For each page PBi NO Add to page-bounds list PBi PB.bounds exists for function Fx associated to E? YES YES Get next element E Disregard the bound B Add bound B to a new page. NO Add to page-bounds list Pbi+1 Continue More details in the paper 5/6/2019

19 Page-Switch Reduction by B2P2
Large nested loop structures restrict the optimization Large sized functions with high call-counts are rightly analyzed by B2P2 5/6/2019

20 Summary TLB consumes significant power; is important hotspot
Need to reduce TLB power Use-last architecture is effective in reducing TLB power 75% on dhrystone Look-up only on a page switch TLB power can be further reduced by code placement Optimization is NP-complete Proposed a greedy heuristic B2P2 Adjusts function placement to minimize page switches in loops and function calls 76% reduction in page-switches Over and above that achieved by Use-last architecture < 2% performance impact < 5% increase in code-size 5/6/2019

21 Future Work We perform code relocation at function-level granularity
Procedure placement at basic-block level can be implemented and performance tradeoff analyzed Main challenge is not to impact performance We implemented greedy knapsack heuristic Formulating the PPP as a network-flow problem may be promising Techniques for data TLB power reduction Use-last by itself is not very effective Data placement can affect cache behavior May not be effective in low-associativity caches 5/6/2019


Download ppt "Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab"

Similar presentations


Ads by Google