Data Access Profiling & Improved Structure Field Regrouping in Pegasus Vas Chellappa & Matt Moore May 2, 2005 / Optimizing Compilers / Project Poster Session.

Slides:



Advertisements
Similar presentations
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Advertisements

Locality / Tiling María Jesús Garzarán University of Illinois at Urbana-Champaign.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part R4. Disjoint Sets.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
Analysis of programs with pointers. Simple example What are the dependences in this program? Problem: just looking at variable names will not give you.
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
Information and Control in Gray-Box Systems Arpaci-Dusseau and Arpaci-Dusseau SOSP 18, 2001 John Otto Wi06 CS 395/495 Autonomic Computing Systems.
1 Framework for Profile-Analysis Data-Layout Optimizations Shai RubinRas BodikTrishul Chilimbi Microsoft ResearchUniversity of Wisconsin.
Linear Buckling Workshop 7. Workshop Supplement Linear Buckling August 26, 2005 Inventory # WS7-2 Workshop 7 - Goals The goal in this workshop is.
Approximating Maximum Edge Coloring in Multigraphs
Parameterized Object Sensitivity for Points-to Analysis for Java Presented By: - Anand Bahety Dan Bucatanschi.
ECE 353: Lab C Pointers and Structs. Basics A pointer holds an address to some variable Notation: – Dereferencing operator: * int *x is a declaration.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Register Allocation (via graph coloring)
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
First Bytes - LabVIEW. Today’s Session Introduction to LabVIEW Colors and computers Lab to create a color picker Lab to manipulate an image Visual ProgrammingImage.
Software Uniqueness: How and Why? Puneet Mishra Dr. Mark Stamp Department of Computer Science San José State University, San José, California.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Using Generational Garbage Collection To Implement Cache- conscious Data Placement Trishul M. Chilimbi & James R. Larus מציג : ראובן ביק.
Register Allocation and Spilling via Graph Coloring G. J. Chaitin IBM Research, 1982.
Memory Allocation CS Introduction to Operating Systems.
1 Shortest Path Calculations in Graphs Prof. S. M. Lee Department of Computer Science.
Antigone Engine Kevin Kassing – Period
A Review of Recursion Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
Zhonghua Qu and Ovidiu Daescu December 24, 2009 University of Texas at Dallas.
M ULTIFRAME P OINT C ORRESPONDENCE By Naseem Mahajna & Muhammad Zoabi.
Cache Locality for Non-numerical Codes María Jesús Garzarán University of Illinois at Urbana-Champaign.
Keystroke Biometric System Client: Dr. Mary Villani Instructor: Dr. Charles Tappert Team 4 Members: Michael Wuench ; Mingfei Bi ; Evelin Urbaez ; Shaji.
Functions, Pointers, Structures Keerthi Nelaturu.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Lesley Charles November 23, 2009.
Level 3 Muon Software Paul Balm Muon Vertical Review May 22, 2000.
Object Model Cache Locality Abstract In modern computer systems the major performance bottleneck is memory latency. Multi-layer cache hierarchies are an.
TECH Computer Science NP-Complete Problems Problems  Abstract Problems  Decision Problem, Optimal value, Optimal solution  Encodings  //Data Structure.
Testing. 2 Overview Testing and debugging are important activities in software development. Techniques and tools are introduced. Material borrowed here.
Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.
Union-find Algorithm Presented by Michael Cassarino.
Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
+ Structures and Unions. + Introduction We have seen that arrays can be used to represent a group of data items that belong to the same type, such as.
CSE 351 Final Exam Review 1. The final exam will be comprehensive, but more heavily weighted towards material after the midterm We will do a few problems.
CoCo: Sound and Adaptive Replacement of Java Collections Guoqing (Harry) Xu Department of Computer Science University of California, Irvine.
API Hyperlinking via Structural Overlap Fan Long, Tsinghua University Xi Wang, MIT CSAIL Yang Cai, MIT CSAIL.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Slide 1 Chapter 6 Structures and Classes. Slide 2 Learning Objectives  Structures  Structure types  Structures as function arguments  Initializing.
Improving Cache Performance of OCaml Programs Case Study - MetaPRL Alexey Nogin and Alexei Kopylov April 15, 1999.
Introduction to Computers and Programming Class 24 Structures (structs) Professor Avi Rosenfeld.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
C# Fundamentals An Introduction. Before we begin How to get started writing C# – Quick tour of the dev. Environment – The current C# version is 5.0 –
Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.
Scientifi c Method Chapter 1: The World of Earth Science.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Greedy & Heuristic algorithms in Influence Maximization
Chapter 10-1: Structure.
Checkpoint Presentation Vas Chellappa Matt Moore
CS 367 – Introduction to Data Structures
CS Introduction to Operating Systems
Adaptive Code Unloading for Resource-Constrained JVMs
Totally Disjoint Multipath Routing in Multihop Wireless Networks Sonia Waharte and Raoef Boutaba Presented by: Anthony Calce.
Classes and Objects.
ECE 551: Digital System Design & Synthesis
Graphs.
Presentation transcript:

Data Access Profiling & Improved Structure Field Regrouping in Pegasus Vas Chellappa & Matt Moore May 2, 2005 / Optimizing Compilers / Project Poster Session

Introduction Structure definitions group fields by semantics, not access contemporaneity Data access profiling can be used to improve cache performance by reordering for contemporaneity In this context, contemporaneity is a measure of how close in time two data accesses to structure fields occur

Problem Statement Obtaining contemporaneity information for structure fields Exploiting this information to improve the ordering of the fields Doing this within the CASH/Pegasus environment

Approach Pegasus Implementation  Data Access Profiling to track contemporaneous field accesses to build the Field Affinity Graphs  Modify Simulator interface to SimpleScalar (3 rd party cache simulator) to achieve this Regrouping Algorithm  Field Affinity Graphs built by the modified Simulator are then used to recommend reorderings based on a new regrouping algorithm

Project Design

Design Overview 1. Build stage: Tag structure field accesses in the Pegasus IR 2. Simulation stage: Propagate tag information through SimpleScalar to the new regroup library 3. Final stage: Invoke regrouping algorithm to calculate reordering recommendations

Build Stage, Tagging Accesses Objective: Identify and tag structure field accesses in the Pegasus IR Not trivial, since SUIF/C2DIL do not preserve required type information during transformation to IR Need to identify patterns that indicate structure field accesses

Field Accesses in Pegasus

Actual Pegasus Illustration int foo(struct my_t stestfoo) { int retval = stestfoo.f2; return(retval); } Which wire here should have struct type? int foo(struct my_t* stestfoo) { return(stestfoo->f2); } Which wire here has struct type?

Simulation Process Tag info on loads and stores is propagated through SimpleScalar to the regrouping library that builds the field affinity graph (done online, during simulation)

Regrouping Stage After simulation, analyze collected profiling data to produce reordering recommendation Can be done better than has been done in previous work (greedy) Cannot be done optimally (NP-hard) Field Affinity Graph (one per structure):  Vertices: fields in a structure  Edge weights: represent degree of contemporaneity of accesses between the fields

Matching Heuristic Find a maximum weight matching in the field affinity graph Fields that will not fit into a cache line together anyway are identified and ignored Structure is reordered by placing matched fields together

Greedy vs. Matching

NP-Hardness NP-Hardness is shown by reducing graph coloring problem to regrouping problem

Results Implemented successfully to handle structure field accesses done through pointers (ptr->fld) So far, only small programs have been tested Reordering is done manually and fed into simulator again to obtain the number of cycles for comparison

Results - Example Original: struct my_t { int f1; int f2; char nu[4096]; int f3; int f4; }; int foo(struct my_t *elt) { int i; elt->f1 = 2; elt->f4 = 100; for(i=0; i < 50; i++) { elt->f1++; elt->f4--; } return elt->f1+elt->f4; } 750 Cycles per Call 745 Cycles per Call (one less cache miss) Modified: struct my_t { int f1; int f4; int f2; char nu[4096]; int f3; }; int foo(struct my_t *elt) { int i; elt->f1 = 2; elt->f4 = 100; for(i=0; i < 50; i++) { elt->f1++; elt->f4--; } return elt->f1+elt->f4; }

Conclusion Performance improvements are achievable even on simple programs using reorganization recommendations Propagation of full type information in SUIF/c2dil from source would be required to optimize non-pointer accesses Less memory-exposed languages would allow for easy and quick implementation of the reordering recommendation

References Trishul M. Chilimbi, Bob Davidson, and James R. Larus, “Cache-Conscious Structure Definition,'' in Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation, pages 13-24, May Mathprog (Weighted Matching Algorithm) Pegasus: SUIF: SimpleScalar Tool set: