1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.

Slides:

Advertisements

Similar presentations

System Integration and Performance

Advertisements

More on File Management

A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.

Performance of Cache Memory

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

The Assembly Language Level

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Software Methods to Increase Data Cache Performance Presented by Philip Marshall.

Program Representations. Representing programs Goals.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

ITEC 320 Lecture 12 Higher level usage of pointers.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.

BTrees & Bitmap Indexes

File Management Systems

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:

©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan

The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

1/25 Pointer Logic Changki PSWLAB Pointer Logic Daniel Kroening and Ofer Strichman Decision Procedure.

Memory Management ◦ Operating Systems ◦ CS550. Paging and Segmentation  Non-contiguous memory allocation  Fragmentation is a serious problem with contiguous.

Exploiting Prolific Types for Memory Management and Optimizations By Yefim Shuf et al.

Chapter 3 Memory Management: Virtual Memory

Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.

Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.

Cache Locality for Non-numerical Codes María Jesús Garzarán University of Illinois at Urbana-Champaign.

1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.

Good Programming Practices for Building Less Memory-Intensive EDA Applications Alan Mishchenko University of California, Berkeley.

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.

Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.

Storage Management - Chap 10 MANAGING A STORAGE HIERARCHY on-chip --> main memory --> 750ps - 8ns ns. 128kb - 16mb 2gb -1 tb. RATIO 1 10 hard disk.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Hashing Hashing is another method for sorting and searching data.

Lecture by: Prof. Pooja Vaishnav.  Language Processor implementations are highly influenced by the kind of storage structure used for program variables.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.

Pointer Analysis Survey. Rupesh Nasre. Aug 24, 2007.

Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon Dept. of Computer Science University of Massachusetts Amherst,

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Chapter 5 Record Storage and Primary File Organizations

GC Assertions: Using the Garbage Collector To Check Heap Properties Samuel Z. Guyer Tufts University Edward Aftandilian Tufts University.

Hello world !!! ASCII representation of hello.c.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.

CS161 – Design and Architecture of Computer

Kernel Code Coverage Nilofer Motiwala Computer Sciences Department

Top 50 Data Structures Interview Questions

CS161 – Design and Architecture of Computer

Compiler Construction (CS-636)

Finding a Needle in Haystack : Facebook’s Photo storage

Process Realization In OS

Improving cache performance of MPEG video codec

CSCI206 - Computer Organization & Programming

Address-Value Delta (AVD) Prediction

Adaptive Code Unloading for Resource-Constrained JVMs

Indexing and Hashing Basic Concepts Ordered Indices

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Presentation transcript:

1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University

2 Motivation Huge processor-memory performance gap  Latency > 100 cycles significant fraction of memory operations in typical programs In many applications, Recursive Data Structures (RDS) constitute a large fraction of memory usage Year

3 Motivation Techniques to minimize the performance impact of this gap  Caching, prefetching, out-of-order execution Not very successful for RDS  Difficult to statically determine many RDS properties  Accesses are irregular and usually lie in critical path of execution while (valid(node)){ //do something //with node->data node = next(node) } Traversal Code Short loop body prevents efficient OoO execution 0x1000 0x2000 0x3000 0x4000 An RDS layout example Non-contiguous layout results in irregular access patterns

4 Motivation Linearization[Clark76, Luk99] Speculation recovery costs outweighs benefits if the next pointer field gets overwritten frequently Information on the dynamic behavior of entire RDS structure is important pos index = 0; head = pos[index] while(head){ foo(head) head = pos[index++] check(head) } head Placement of the nodes in the figure correspond to their placement in memory

5 RDS Profile RDS profiling gives a ‘logical’ understanding of runtime behavior  ‘Application creates 100 trees’ instead of ‘application allocates 2MB in heap’  ‘Linked list traversed 10 times’ instead of ‘Address 0x accessed 200 times’ Profile for linearization: next pointer field in list L is modified n times

6 node *tree_create(){ node *n = (node *)malloc(…); … n->left = tree_create(…); n->right = tree_create(…); } RDS Discovery Assign unique id for value returned by malloc and create a node labeled by that id Connect nodes by a directed edge if both the address and the value of a store have valid ids 123 Dynamic Shape Graph C function for creating a tree call malloc ; id = 1 mov r10 = r8 … call tree_create … call malloc ; id = 2 … mov r11 = r8 store r10[offset1] = r11 ; create 1->2 call tree_create … call malloc ; id = 3 … mov r12 = r8 store r10[offset2] = r12; create 1->3 Execution trace in (pseudo) assembly

7 RDS Discovery Multiple RDS instances can be connected together in the DSG! To separate them, we use properties of the static code  Use another graph called Static Shape Graph (SSG) array = malloc(…); for (i=…) array[i] = create_tree(…); … 5

8 RDS discovery For every static call to malloc, create a node with unique id in the Static Shape Graph (SSG) If a store creates an edge, connect the corresponding static nodes Check for SCCs in the SSG Connect two dynamic nodes only if their corresponding static nodes are in same SCC DSG Execution trace in (pseudo) assembly A T SSG call malloc; id = 1 Mov r20 = r8 …call malloc ; id = 2 …mov r10 = r8 …… …call tree_create …… call malloc ; id = 3 …… mov r11 = r8 …store r10[offset1] = r11; create 2->3 …call tree_create …… call malloc ; id = 4 …… mov r12 = r8 … store r10[offset2] = r12; create 2->4 store r20[0] = r10 ; create 1->2

9 Experimental setup Uses Pin, a dynamic instrumentation tool for Itanium Mapping between address ranges and dynamic ids are stored in an AVL tree  Most recent mapping is cached A mix of benchmarks from SPEC, Olden and other pointer intensive applications  Dynamic instruction count varies from a few million (ks) to over 300 billion (mesa) All experiments run on a 900MHz Itanium 2 with 2 GB RAM running RH 7.1

10 Profiler Performance Profile: RDS size, lifetime, access count Memory: <16 MB for all but 3 applications Baseline: Execution using Pin (~ 10 times slower than native)

11 RDS usage statistics SCCs in static shape graph (RDS types)  Usually a few(<5) per benchmark, a maximum of 31 in parser #RDS instances (connected components in DSG)  Exhibits a wide range (1 in mcf to around million in parser)  Tend to be live for long if the program creates only a few of them Sizes of RDS instances  Varies from a single node self-loop (parser) to a few hundred thousand nodes (mcf, parser) #pointer chasing loads  Significant in many benchmarks Applications show vast diversity in RDS usage  A good reason for profiling them!

12 Temporal distribution

13 Cumulative distribution of RDS lifetimes

14 RDS Stability Stability of an RDS : A notion of how 'array-like' an RDS is Stability index : an attempt to quantify this notion  Identify the time instances (alteration points) when changes occur to the RDS structure (by stores that replace existing pointers)  Count the traversals between successive alteration points  Stability index = #intervals that account for ‘most’ of the traversals  Lower index means higher stability

15 Cumulative distribution of stability index

16 Conclusion Aggressive data structure level optimization techniques for RDS need profile information for improved performance RDS profiling gives a better understanding of the runtime behavior of RDS RDS usage varies widely across benchmarks

17 Extra Slides

18 RDS Profiling: Definitions RDS type: The abstract form of the logical data structure that is manipulated by the program  Examples: list, binary tree, graph, etc.  Can be mutually recursive (nodes point to their incident edges and vice versa to form a graph) RDS instance: A concrete realization of the RDS type  Example: the tree created in function foo, the list pointed to by the first entry of the hash table.

19