1 Framework for Profile-Analysis Data-Layout Optimizations Shai RubinRas BodikTrishul Chilimbi Microsoft ResearchUniversity of Wisconsin.

Slides:



Advertisements
Similar presentations
Data Structures Static and Dynamic.
Advertisements

Part IV: Memory Management
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
CS 480 Lec 3 Sept 11, 09 Goals: Chapter 3 (uninformed search) project # 1 and # 2 Chapter 4 (heuristic search)
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
1 1 Lecture 4 Structure – Array, Records and Alignment Memory- How to allocate memory to speed up operation Structure – Array, Records and Alignment Memory-
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS 104 Introduction to Computer Science and Graphics Problems
An Efficient Profile-Analysis Framework for Data-Layout Optimizations By Shai Rubin, Rastislav Bodik, Trishul Chilimbi.
Previous finals up on the web page use them as practice problems look at them early.
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
Cache-Conscious Data Placement Amy M. Henning CS 612 April 7, 2005.
Computer Organization and Architecture
Memory Management Five Requirements for Memory Management to satisfy: –Relocation Users generally don’t know where they will be placed in main memory May.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Recursion and Implementation of Functions
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
1/25 Pointer Logic Changki PSWLAB Pointer Logic Daniel Kroening and Ofer Strichman Decision Procedure.
Layers and Views of a Computer System Operating System Services Program creation Program execution Access to I/O devices Controlled access to files System.
CSS430 Virtual Memory Textbook Ch9
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Chapter 3 Memory Management: Virtual Memory
Operating System Chapter 7. Memory Management Lynn Choi School of Electrical Engineering.
1 File Systems Chapter Files 6.2 Directories 6.3 File system implementation 6.4 Example file systems.
Microprocessor-based systems Curse 7 Memory hierarchies.
Cache Locality for Non-numerical Codes María Jesús Garzarán University of Illinois at Urbana-Champaign.
© 2004, D. J. Foreman 1 Memory Management. © 2004, D. J. Foreman 2 Building a Module -1  Compiler ■ generates references for function addresses may be.
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Chapter 4 Memory Management.
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
Object Model Cache Locality Abstract In modern computer systems the major performance bottleneck is memory latency. Multi-layer cache hierarchies are an.
Storage Management - Chap 10 MANAGING A STORAGE HIERARCHY on-chip --> main memory --> 750ps - 8ns ns. 128kb - 16mb 2gb -1 tb. RATIO 1 10 hard disk.
CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 9.
1 Advanced Memory Management Techniques  static vs. dynamic kernel memory allocation  resource map allocation  power-of-two free list allocation  buddy.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Mark Marron IMDEA-Software (Madrid, Spain) 1.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Operating Systems (CS 340 D) Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
CS 147 Virtual Memory Prof. Sin Min Lee Anthony Palladino.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Structure of Compilers Lexical Analyzer (scanner) Modified Source Program Parser Tokens Semantic Analysis Syntactic Structure Optimizer Code Generator.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
Cache-Conscious Data Placement Adapted from CS 612 talk by Amy M. Henning.
CS203 – Advanced Computer Architecture Virtual Memory.
CS161 – Design and Architecture of Computer
CS161 – Design and Architecture of Computer
Memory Management © 2004, D. J. Foreman.
5.2 Eleven Advanced Optimizations of Cache Performance
Instructor: Junfeng Yang
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CS200: Algorithm Analysis
Optimizing Malloc and Free
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Chapter 5 Memory CSE 820.
Binding Times Binding is an association between two things Examples:
Virtual Memory: Working Sets
RUN-TIME STORAGE Chuen-Liang Chen Department of Computer Science
COMP755 Advanced Operating Systems
Presentation transcript:

1 Framework for Profile-Analysis Data-Layout Optimizations Shai RubinRas BodikTrishul Chilimbi Microsoft ResearchUniversity of Wisconsin

2 Data Layout Optimization (What) CPU Cache Memory References sequence: A.x, B, A.z 1 cycle 10 2 cycles 10 6 cycles Disk B A A.x time cache blocks Memory Pages 1 2 B A A time cache blocks B Memory Pages 1 2 DL Optimization A.x B A.z A B B DL optimization: increase spatial locality of data to prevent memory faults. Original data layout Modified data layout A.z B A A.xA.z A.x

3 Data Layout Layout Space Data Layout Optimization (How) Optimal for simple loops Heuristic Reference Summary Array Dep. Analysis (static) Ref. Trace (dynamic) Scientific (array based) General purpose (pointer based) Compile Time 1. Compile Time 2. Runtime Program Optimal Layout Enforce layout Data Layout Optimizer “ Good ” Layout Program′

4 Problems with Current Data-Layout Optimization Computationally hard to find the optimal layout [Petrank]. Computationally hard to approximate the optimal layout [Petrank]. Implication - heuristics are not robust: –will not work for all programs. From our experience with heuristics: –Field Reordering [Chilimbi PLDI ’ 99] – no improvement (on perl). –Custom Memory Allocator [Seidl ASPLOS ’ 98] degrades performance (on espresso). Our approach: replace heuristic with feedback-driven search.

5 Data Layout Space Searching For a Data Layout Current program data layout “Good” Layouts “Good” + “easy” to enforce layouts –a “good” layout. Search advantage: –Robust, for each program finds a “ good ” layout. Optimal data layout –an “easy” to enforce layout. Problem: Perform a search in the data layout space. Look for:

6 Is Search Practical? Possible layouts Data Layout Reference Trace Optimizer (Heuristic) Enforce layout EditCompileExecuteEvaluateContinue? End Not clear: Enforce

7 Outline Background and Problem Definition Search is a solution, but may not practical –Making the search practical Applications Summary

8 Making the Search Practical Reference Trace Data Layout Search Engine EditCompileExecuteEvaluate Continue? En d Compress(T)  CST Data Object Analysis DOA(CST,LS)  NLS Layout Selector LS(NLS,B,CST,SS)  DL Enforce Layout AL(DL,CST)  NT Evaluate Simulate(NT)  B “good “and enforceable layouts Class Splitting Linearization Field Reordering Layout Space Narrowed Space Search Strategy Trace Data Layout New Trace Continue(B) Benefit Compressed Symbolic Trace Search Strategy T T Trace Framework for Data Layout Optimization T

9 Trace Representation Problem: reference trace cannot be easily manipulated since it is too large (>10GB, >100M references). Solution: compressed trace (using modified SEQUITUR). Example: - Trace: acbcbcbcbdbdbdbde Representation advantage: - Compact; fits into main memory [ChilimbiPLDI’01]. - Expose repetitions (we use this later). - It produces a symbolic trace (i.e., a terminal is a data object). SEQUITUR Representation S  acBBBAAe B  bc A  CC C  bd

10 Framework for Data-Layout Optimization Reference Trace Data Layout Search Engine Compile Continue? En d Compress(T)  CST Data Object Analysis DOA(CST,LS)  NLS Layout Selector LS(NLS,B,CST,SS)  DL Enforce Layout EL(DL,CST)  CST’ Evaluate Simulate(NT)  B “good “and enforceable layouts Class Splitting Linearization Field Reordering Layout Space Narrowed Space Search Strategy Trace Data Layout Continue(B) Benefit Compressed Symbolic Trace Search Strategy  New Trace

11 Avoid re-compilation Problem: data layout evaluation  (edit+compilation+simulation). Solution: “ pretend ” that the program was edited and compiled. A.x, B, A.z, B A.x  10 A.z  14 B  20 30,20,34,20 New concrete trace Single symbolic trace Compile Run (simulate) Edit program Enforce Layout Symbolic trace + data layout  concrete address trace. A.x  30 A.z  34 B  20 30,20,34,20 Simple, but crucial for an efficient search. User (Optimizer) Simulate

12 Framework for Data-Layout Optimization Reference Trace Data Layout Search Engine Compile Continue? En d Compress(T)  CST Data Object Analysis DOA(CST,LS)  NLS Layout Selector LS(NLS,B,CST,SS)  DL Enforce Layout EL(DL,CST)  CST’ Evaluate Simulate(CST’)  B “good “and enforceable layouts Class Splitting Linearization Field Reordering Layout Space Narrowed Space Search Strategy Trace Data Layout Continue(B) Benefit Compressed Symbolic Trace Search Strategy   New Trace

13 Memoization: Efficient Trace Simulation Evaluation using simulation: MissRate T =Simulate(T); Problem: simulation of the whole trace (T) is too expensive. Solution: avoids re-simulation of repeated sub-traces. SEQUITUR Representation S  BBBAA B  bc A  CC C  bd CS C =Simulate′(C) CS B =Simulate ′ (B) CS A = CS C  CS C CS S = CS B  CS B  CS B  CS A  CS A T: bcbcbcbdbdbdbd Memoization: 1.Simulate each “low level” rule, compute its memoization value. −For cache simulation: memoization value = CacheState [CS]. 2.Recursively compose memoization values for “ higher ” rules. MissRate T =

14 Outline Background and Problem Definition Search is a solution, but maybe not feasible –Making the search practical: Trace representation Avoid recompilation Efficient simulation Applications Summary

15 Framework Application (1) Application: an implementation of the framework that searches in a sub-space of the layout space. Field Reordering: –Objective: reduce number of cache misses. –Sub-space: all possible (legal) orders of fields in (heap) objects. –Our search strategy: (almost) exhaustive search.

16 Field Reordering: Exhaustive Search We compared: –Best field order found by our iterative search. –Field orders produced by existing heuristics: Fields Temporal Affinity [ChilimbiPLDI ’ 99] Fields Access Frequency [TruongPACT ’ 98]. Runtime improvement: 0%-4.5%.

17 Custom Memory Allocator (CMA) A B A Page 1 Page 2 B A time address ABA Page 1 Page 2 BA time address Objective: reduce number of page faults. Allocator 1 Allocator 2 Poor localityGood locality CMA can work well if it has a good placement function: assigns dynamically allocated heap objects to memory pages (heaps). Reference trace: ABABA

18 CMA Placement Function (PF) malloc(size s){ } PF: Map objects to heaps PF(heap object)  int How we can find a placement function using our framework? A placement function defines a data layout. Learn by measuring the benefits of its data layout. How: use a learning algorithm. Learner PF(Attributes)  int Use Framework to Evaluate PF Size 1 2 size<24 size  24 Decision Tree Learner Profiling Information Profile(Heap objects)  runtime attributes

19 CMA Results ProgramNumber of heaps Espresso2 Boxsim8 Twolf5 Perl5 Ghostscript10 Lp_solve6 1 Relative to original working set size.

20 Contributions and Future Work Formulate data layout optimization as a search process. Build a framework for efficient search process. Improve existing optimizations; enable new optimizations. Framework limitations: –Difficult to handle very large traces (>0.5B references). –Requires some guidance from the programmer (search strategy). Future work –Advanced search strategies that combine several optimizations. –Other non-data-layout optimization – prefetching.