Faculty of Computer Science © 2008 José Nelson Amaral MPADS: Memory- Pooling-Assisted Data Splitting Stephen Curial - Xymbiant Systems Inc. Peng Zhao -

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Intel MP.
Computer Organization and Architecture
Code Compaction of an Operating System Kernel Haifeng He, John Trimble, Somu Perianayagam, Saumya Debray, Gregory Andrews Computer Science Department.
Chapter 2 Instructions: Language of the Computer Part III.
1 Framework for Profile-Analysis Data-Layout Optimizations Shai RubinRas BodikTrishul Chilimbi Microsoft ResearchUniversity of Wisconsin.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.
Extensions to Structure Layout Optimizations in the Open64 Compiler Michael Lai AMD.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
© 2002 IBM Corporation IBM Toronto Software Lab October 6, 2004 | CASCON2004 Interprocedural Strength Reduction Shimin Cui Roch Archambault Raul Silvera.
Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Faculty of Computer Science © 2006 CMPUT 229 Memory Hierarchy Part 2 Refreshing Memory.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Intro to Java The Java Virtual Machine. What is the JVM  a software emulation of a hypothetical computing machine that runs Java bytecodes (Java compiler.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Secure Virtual Architecture John Criswell, Arushi Aggarwal, Andrew Lenharth, Dinakar Dhurjati, and Vikram Adve University of Illinois at Urbana-Champaign.
Making Object-Based STM Practical in Unmanaged Environments Torvald Riegel and Diogo Becker de Brum ( Dresden University of Technology, Germany)
Lecture No.01 Data Structures Dr. Sohail Aslam
Automatic Data Partitioning in Software Transactional Memories Torvald Riegel, Christof Fetzer, Pascal Felber (TU Dresden, Germany / Uni Neuchatel, Switzerland)
A genda for Today What is memory management Source code to execution Address binding Logical and physical address spaces Dynamic loading, dynamic linking,
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Microprocessor-based systems Curse 7 Memory hierarchies.
Putting Pointer Analysis to Work Rakesh Ghiya and Laurie J. Hendren Presented by Shey Liggett & Jason Bartkowiak.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Fast Points-to Analysis for Languages with Structured Types Michael Jung and Sorin A. Huss Integrated Circuits and Systems Lab. Department of Computer.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Chapter 2 Instructions: Language of the Computer Part I.
The Fail-Safe C to Java translator Yuhki Kamijima (Tohoku Univ.)
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Transparent Pointer Compression for Linked Data Structures June 12, 2005 MSP Chris Lattner Vikram Adve.
Pointer Analysis Survey. Rupesh Nasre. Aug 24, 2007.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Object-Relative Addressing: Compressed Pointers in 64-bit Java Virtual Machines Kris Venstermans, Lieven Eeckhout, Koen De Bosschere Department of Electronics.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Open64 | The Open Research Compiler Ben Reinhardt and Cliff Piontek.
ISA's, Compilers, and Assembly
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Detecting Type-Based Alias Analysis Violations in C Iain Ireland (University of Alberta) Jose Nelson Amaral (University of Alberta) Raul Silvera (IBM Canada)
The Goal: illusion of large, fast, cheap memory
Cache Memory Presentation I
Code Generation.
Computer Programming Machine and Assembly.
A Framework for Safe Automatic Data Reorganization Shimin Cui (Speaker), Yaoqing Gao, Roch Archambault, Raul Silvera IBM Toronto Software Lab Peng Peers.
Computer Architecture
rePLay: A Hardware Framework for Dynamic Optimization
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Faculty of Computer Science © 2008 José Nelson Amaral MPADS: Memory- Pooling-Assisted Data Splitting Stephen Curial - Xymbiant Systems Inc. Peng Zhao - Intel Corporation J. Nelson Amaral - University of Alberta Yaoqing Gao, Shimin Cui, Raul Silvera, Roch Archambault - IBM Toronto Software Laboratory FROM SUN MICROSYSTEMS

© 2006 Department of Computing Science ISMM 2008 Goal  What: –Improve spatial locality  Where: –Linked-based data structures  How: –Pooling similar structures together –Grouping same fields from multiple objects together

© 2006 Department of Computing Science ISMM 2008 Goal (cont.)  Why: –Because we can –Allow easy-to-write, easy-to-read, easy-to-maintain code to improve performance  What compiler: –IBM XL compiler suite  Limitation: –Needs more precise pointer analysis to benefit from more opportunities

© 2006 Department of Computing Science ISMM 2008 Most Relevant Earlier Work  Pool Allocation –Lattner and Adve (CGO 04, PLDI 05)  Reference Affinity –Zhong, Orlovich, Shen, Ding (PLDI 04) –Rabbah and Palem (TECS 03)  Array Reshaping –Zhao, Cui, Gao, Silvera, Amaral (TOPLAS 07)

© 2006 Department of Computing Science ISMM 2008 A refreshing outcome “MPADS is not the first implementation of the combination of memory pools and splitting of pointer-based data structures.” “MPADS is still not delivering its full potential on standard benchmarks in the IBM XL compiler.” Reviewer’s Comment: “The technique only worked for Olden, and did nothing for SPECcpu2000 (but the authors get bonus points for being honest about that.)”

© 2006 Department of Computing Science ISMM 2008 The Cost of Programming Productivity  Easy-to-read and easy-to-maintain code often results in lower runtime performance. Student Class University

© 2006 Department of Computing Science ISMM 2008 The Cost of Programming Productivity  Abstraction  Inheritance Student Professor Support Staff Person

© 2006 Department of Computing Science ISMM 2008 The Cost of Programming Productivity  Data Encapsulation Person Date of Birth Address Driver Lic. Citizenship Name Gender Student Faculty Date of Adm Department Program Univ. ID Classes Enr. Grades

© 2006 Department of Computing Science ISMM 2008 A possible data layout Faculty Date of Adm Department Program Univ. ID Classes Enr. Grades Student: 1 byte 4 bytes 1 byte 2 bytes 4 bytes Date of Birth Address Driver Lic. Gender Name Citizenship Person: 4 bytes 32 bytes 3 bytes 1 byte 32 bytes 16 bytes

© 2006 Department of Computing Science ISMM 2008 Data in Memory Memory Address Univ. IDDate of Adm. Fa. De Progr.Classes Enr. Grades Univ. IDDate of Adm.Univ. IDDate of Adm. Fa. De Progr.Classes Enr. Grades  Memory Address Name Date of Birth Address Dr. Lic. Ge Citizenship 

© 2006 Department of Computing Science ISMM 2008 Assume a Cache Organization  POWER5 Cache Organization –L1 Data Cache: 32 Kbytes, 128-byte cache lines –L2 Cache: 1.44 Mbytes, 128-byte cache lines –L3 Cache: 32 Mbytes, 512-byte cache lines

© 2006 Department of Computing Science ISMM 2008 Cache Organization Bytes Cache Lines

© 2006 Department of Computing Science ISMM 2008 Example: A search through the data structures Bytes Cache Lines How many Computing Science students are younger than 23 year old? Univ.ID Adm. F. D. Prg Class.GradesUniv.ID Adm. F. D. PrgClass. 

© 2006 Department of Computing Science ISMM 2008 Example: A search through the data structures Bytes Cache Lines Student structure: For every 24 bytes loaded, reads either 1 or 5. Univ.ID Adm. F. D. Prg Class.GradesUniv.ID Adm. F. D. PrgClass. 

© 2006 Department of Computing Science ISMM 2008 Example: A search through the data structures Bytes Cache Lines Univ.ID Adm. F. D. Prg Class.GradesUniv.ID Adm. F. D. PrgClass.  NameDofB G Citizens. Address DL

© 2006 Department of Computing Science ISMM 2008 Example: A search through the data structures Bytes Cache Lines Person structure: For every 88 bytes loaded, reads 4. Univ.ID Adm. F. D. Prg Class.GradesUniv.ID Adm. F. D. PrgClass.  NameDofB G Citizens. Address DL

© 2006 Department of Computing Science ISMM 2008 Data Reshaping for Arrays of Structures Student*ListOfStudents; …. ListOfStudents = (Student*)malloc(….); Univ. IDDate of Adm. Fa. De Progr. Classes Enr.GradesUniv. IDDate of Adm. Fa. De Progr. Classes Enr.GradesUniv. IDDate of Adm. Fa. De Progr. Classes Enr.Grades Univ. ID Date of Adm. Fa. De Progr. Univ. ID Date of Adm. Fa. De Progr. Univ. ID Date of Adm. Fa. De Progr.

© 2006 Department of Computing Science ISMM 2008 Maximal Structure Splitting ID 1 Adm 1 Dep 1 Fac 1 Clas 1 ID 2 Adm 2 Dep 2 Fac 2 Clas 2 ID 3 Adm 3 Dep 3 Fac 3 Clas 3 ID 1 ID 2 ID 3 Adm 1 Adm 2 Adm 3 Fac 1 Fac 2 Fac 3 Dep 1 Dep 2 Dep 3 Clas 1 Clas 2 Clas 3 Grad 1 11 Grad 2 22 Grad 3 33 Grad 1 Grad 2 Grad 3 11 22 33

© 2006 Department of Computing Science ISMM 2008 Implementation of Pool Allocation  Intercept mallocs and replace by pool allocation: each structure layout gets its own pool.  If pool is full another pool can be allocated ID 1 Adm 1 Fac 1 Dep 1 Clas 1 Grad 1 11 ID 2 Adm 2 Fac 2 Dep 2 Clas 2 Grad 2 22 ID 3 Adm 3 Fac 3 Dep 3 Clas 3 Grad 3 33 ID 4 Adm 4 Fac 4 Dep 4 Clas 4 Grad 4 44 ID 5 Adm 5 Fac 5 Dep 5 Clas 6 Grad 5 66 ID 7 Adm 7 Fac 7 Dep 7 Clas 7 Grad 7 77

© 2006 Department of Computing Science ISMM 2008 Implementing Pool Allocation  The following types of statements need to be transformed: –Memory allocation statements –Memory reference statements

© 2006 Department of Computing Science ISMM 2008 Transforming Memory Allocation Statements  Extended pointer analysis to maintain a set of allocation sites associated with each alias set.  When an alias set is selected for transformation: –Replace each associated allocation with a call to the pool allocation function.

© 2006 Department of Computing Science ISMM 2008 Transforming Memory References  Update address calculation for loads and stores: –Uniform splitting --- all fields are the same size Address calculation is simpler Restricts application of technique or Requires memory padding –Non-uniform splitting --- fields of different size Address calculation is more involved Can be applied more generally

© 2006 Department of Computing Science ISMM 2008 Non-Uniform Example struct example { type_3 a; /* 3 bytes */ type_7 b; /* 7 bytes */ type_5 c; /* 5 bytes */ }; s How can the compiler find the address to access: s->c pool_base = s & 0xF…F000 index = (s – pool_base) / 3 field_base = (3+7)*num_structs_per_pool s->c = *(s + field_base - 3*index + 5*index) s->c = *(s + field_base + (5-3)*index) field_base pool_base

© 2006 Department of Computing Science ISMM 2008 Data Transformation Safety  How the compiler decide whether it is safe to transform a given structure? –Based on the results of the pointer analysis.

© 2006 Department of Computing Science ISMM 2008 Is it safe to transform a given data structure? Structure layout: two structures have the same layout if each field has the same offset and the same length.  Build alias set –If a pointer P may point to the structure Then all the objects in the points-to set of the alias set of P must have the same layout. Data Struct 1 Data Struct 2 P Q Alias set Points-to set

© 2006 Department of Computing Science ISMM 2008 Experimental Results - Micro Benchmarks (Speedup) Power 4 Power 5

© 2006 Department of Computing Science ISMM 2008 Experimental Results - Micro Benchmarks (Instruction Count) Power 4 Power 5

© 2006 Department of Computing Science ISMM 2008 Experimental Results - Micro Benchmarks (L2 Cache Misses) Power 4 Power 5

© 2006 Department of Computing Science ISMM 2008 Experimental Study - Olden & LLU (Speedup) Power 4 Power 5 bh em3d health power tsp llu bh em3d health power tsp llu

© 2006 Department of Computing Science ISMM 2008 Active Hardware Prefetch Streams Active Prefetching Streams from Memory to L2 (in POWER4)

© 2006 Department of Computing Science ISMM 2008 Related Work  Pool Allocation –Lattner & Adve - PLDI 2005 Data Structure Analysis  Array Based Structure Splitting –Zhong et al. - PLDI 2004 Reference affinity / affinity based splitting Memory Trace  Safe Pointer Based Structure Splitting –Jeon, Shin and Han - CC 2007 Similar to non-uniform splitting Affinity based splitting uses static analysis –Regular expression framework –Guarantee Safety with regular expressions

© 2006 Department of Computing Science ISMM 2008 Final Remarks  Our Compiler-Research Guiding Principles –Programming productivity Enables programmers to be efficient Enables easy-to-write/easy-to-maintain programs – Execution Time Performance Recover runtime efficiency (time, storage or energy) through –Code analysis –Improved code generation –Knowledge of computer architecture and memory hierarchy

© 2006 Department of Computing Science ISMM 2008

© 2006 Department of Computing Science ISMM 2008

© 2006 Department of Computing Science ISMM 2008 Pointer Analysis Primer  The following statement: int *a = malloc(…);  Creates: a memory object (A), a pointer (a), and a points-to relation (a,A): a A

© 2006 Department of Computing Science ISMM 2008 Alias Analysis Primer: Andersen’s X Steensgaard’s a = &b; Program: Steensgaard (unification-based): Andersen: S = {(a,b)} a b b a (Shapiro/Horwitz, PPL97)

© 2006 Department of Computing Science ISMM 2008 a = &b; b = &c; Program: Andersen: S = {(a,b); (b,c)} c a b c b a (Shapiro/Horwitz, PPL97) Alias Analysis Primer: Andersen’s X Steensgaard’s Steensgaard (unification-based):

© 2006 Department of Computing Science ISMM 2008 a = &b; b = &c; a = &d; Program: Andersen: S = {(a,b); (b,c)} S = {(a,b); (b,c); (a,d)} c a b d c b a (Shapiro/Horwitz, PPL97) Alias Analysis Primer: Andersen’s X Steensgaard’s Steensgaard (unification-based): What should happen in the Steensgaard analysis?

© 2006 Department of Computing Science ISMM 2008 a = &b; b = &c; a = &d; Program: Andersen: S = {(a,b); (b,c); (a,d); (d,c)} S = {(a,b); (b,c); (a,d)} c a b d c (b,d) a (Shapiro/Horwitz, PPL97) Alias Analysis Primer: Andersen’s X Steensgaard’s Steensgaard (unification-based):

© 2006 Department of Computing Science ISMM 2008 a = &b; b = &c; a = &d; d = &e; Program: Andersen: S = {(a,b); (b,c); (a,d); (d,c)} S = {(a,b); (b,c); (a,d)} c a b d c (b,d) a (Shapiro/Horwitz, PPL97) And now? Alias Analysis Primer: Andersen’s X Steensgaard’s Steensgaard (unification-based):

© 2006 Department of Computing Science ISMM 2008 a = &b; b = &c; a = &d; d = &e; Program: Andersen: S = {(a,b); (b,c); (a,d); (d,c); (d,e); (b,e)} S = {(a,b); (b,c); (a,d); (d,e)} c a b d e (c,e) (b,d) a (Shapiro/Horwitz, PPL97) Alias Analysis Primer: Andersen’s X Steensgaard’s Steensgaard (unification-based):