Skadu: Efficient Vector Shadow Memories for Poly-Scopic Program Analysis Donghwan Jeon*, Saturnino Garcia +, and Michael Bedford Taylor UC San Diego 1.

Slides:



Advertisements
Similar presentations
Chapter 11 – Virtual Memory Management
Advertisements

More on File Management
Dynamic memory allocation
Lecture 12 Reduce Miss Penalty and Hit Time
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Kismet: Parallel Speedup Estimates for Serial Programs Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor Computer Science and Engineering.
Data Structures: A Pseudocode Approach with C
Praveen Yedlapalli Emre Kultursay Mahmut Kandemir The Pennsylvania State University.
CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
CS 153 Design of Operating Systems Spring 2015
Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
Chapter 11 – Virtual Memory Management Outline 11.1 Introduction 11.2Locality 11.3Demand Paging 11.4Anticipatory Paging 11.5Page Replacement 11.6Page Replacement.
HARDBOUND: ARCHITECURAL SUPPORT FOR SPATIAL SAFETY OF THE C PROGRAMMING LANGUAGE Kyle Yan Yu Xing 2014/10/15.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Qin Zhao (MIT) Derek Bruening (VMware) Saman Amarasinghe (MIT) Efficient Memory Shadowing for 64-bit Architectures ISMM 2010, Toronto, Canada June 6, 2010.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Memory Management in Windows and Linux &. Windows Memory Management Virtual memory manager (VMM) –Executive component responsible for managing memory.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Paging. Memory Partitioning Troubles Fragmentation Need for compaction/swapping A process size is limited by the available physical memory Dynamic growth.
CS 346 – Chapter 8 Main memory –Addressing –Swapping –Allocation and fragmentation –Paging –Segmentation Commitment –Please finish chapter 8.
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
Introduction Overview Static analysis Memory analysis Kernel integrity checking Implementation and evaluation Limitations and future work Conclusions.
An Adaptive, Region-based Allocator for Java Feng Qian, Laurie Hendren {fqian, Sable Research Group School of Computer Science McGill.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
1 Chapter 3.2 : Virtual Memory What is virtual memory? What is virtual memory? Virtual memory management schemes Virtual memory management schemes Paging.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
Thread-Level Speculation Karan Singh CS
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,
Pointers OVERVIEW.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Address Translation. Recall from Last Time… Virtual addresses Physical addresses Translation table Data reads or writes (untranslated) Translation tables.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Lecture by: Prof. Pooja Vaishnav.  Language Processor implementations are highly influenced by the kind of storage structure used for program variables.
PARALLEL RECURSIVE STATE COMPRESSION FOR FREE ALFONS LAARMAN JOINT WORK WITH: MICHAEL WEBER JACO VAN DE POL 12/7/2011 SPIN 2011.
Transparent Pointer Compression for Linked Data Structures June 12, 2005 MSP Chris Lattner Vikram Adve.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Chapter 5 Index and Clustering
Single Node Optimization Computational Astrophysics.
Address Translation Mark Stanovich Operating Systems COP 4610.
Memory Management Continued Questions answered in this lecture: What is paging? How can segmentation and paging be combined? How can one speed up address.
By Anand George SourceLens.org Copyright. All rights reserved. Content Owner - Meera R (meera at sourcelens.org)
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
GC Assertions: Using the Garbage Collector To Check Heap Properties Samuel Z. Guyer Tufts University Edward Aftandilian Tufts University.
W4118 Operating Systems Instructor: Junfeng Yang.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 12: File System Implementation.
Kernel Code Coverage Nilofer Motiwala Computer Sciences Department
Gwangsun Kim, Jiyun Jeong, John Kim
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Martin Rinard Laboratory for Computer Science
HEXA: Compact Data Structures for Faster Packet Processing
CSCI206 - Computer Organization & Programming
Adaptive Code Unloading for Resource-Constrained JVMs
Kremlin: Rethinking and Rebooting gprof for the Multicore Era
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Lecture 7: Flexible Address Translation
Chapter 14: File-System Implementation
Lecture 8: Efficient Address Translation
Presentation transcript:

Skadu: Efficient Vector Shadow Memories for Poly-Scopic Program Analysis Donghwan Jeon*, Saturnino Garcia +, and Michael Bedford Taylor UC San Diego 1 * Now at Google, Inc. + Now at University of San Diego Skadu means ‘Shadow’ in Afrikaans.

Dynamic Program Analysis Runtime analysis of a program’s behavior. Powerful: gives perfect knowledge of a program’s behavior with a specific input. –Resolves all the memory addresses. –Resolves all the control paths. Challenges (Offline DPA) –Memory overhead (> 5X is a serious problem) 1GB -> 5GB –Runtime overhead (> 250X is a serious problem) 5 minutes -> 1 day 2

Motifs for Dynamic Program Analysis Shadow Memory: associates analysis metadata with program’s dynamic memory addresses. Poly-Scopic Analysis: associates analysis metadata with program’s dynamic scopes. 3

Motifs for Dynamic Program Analysis Shadow Memory: associates analysis metadata with program’s dynamic memory addresses. Poly-Scopic Analysis: associates analysis metadata with program’s dynamic scopes. Vector Shadow Memory (this paper): associates analysis metadata with BOTH dynamic memory addresses AND dynamic scopes. 4

Motif #1 Shadow Memory: An Enabler for Dynamic Memory Analysis Data structure that associates metadata (or tag) with each memory address. Typically implemented with a multi-level table. 5 … Ptr Tag Table Two-level Shadow Memory 0 3 … P0P0 P1P1 int *addr = 0x4000; 0x4000 0x4004 0x4008 … int value = *addr; *(addr+1) = value; Sample C Code 1 4 0x4000 0x0000 0x8000 Example: Counting Memory Accesses of Each Address P1P1 P1P1

Dynamic Program Analyses Employing Shadow Memory Critical Path Analysis [1988] –Finds the longest dynamic dependence chain in a program. –Employs shadow memory to propagate earliest completion time of memory operations. –Useful for approximating the quantity of parallelism available in a program. TaintCheck [2005] Valgrind [2007] Dr. Memory [2011] … 6

Motif #2 Poly-Scopic Analysis Analyzes multiple dynamic scopes (e.g. loops and functions on callstack) by recursively running a dynamic analysis. Main benefit: provides scope-localized information. Commonly found in performance optimization tools that focus programmer attention on specific, localized program regions.

Poly-Scopic Analysis Example: Time Profiling (e.g. Intel VTune) for (j=0 to 32) for (i=0 to 4) for (k=0 to 2) foo (); bar1(); bar2(); Recursively measures each scope’s total-time. –total-time (scope) = time end (scope) – time begin (scope) Pinpoints important scopes to optimize by using self-time. –self-time (scope) = total-time(scope) - total-time(children) Scopetotal-time Loop i100% Loop j50% foo()50% Loop k50% bar1()25% bar2()25% Poly-Scopic Analysis self-time 0% 50% 0% 25%

Hierarchical Critical Path Analysis (HCPA): Converting CPA to Poly-Scopic Recursively measures total-parallelism by running CPA. –“How much parallelism is in each scope?” Pinpoint important scopes to parallelize with self-parallelism. –“What is the merit of parallelizing this loop? (e.g. outer, middle, inner)” HCPA is useful. –Provides a list of scopes that deserve parallelization [PLDI 2011]. –Estimates the parallel speedup from a serial program [OOPSLA 2011]. Beats 3 rd party experts Wow!!

Memory Overhead: Key Challenge of Using Shadow Memory in a Poly-Scopic Analysis A conventional shadow memory already incurs high memory overhead (e.g. CPA). Poly-scopic analysis requires an independent shadow memory space for each dynamic scope, causing multiplicative memory expansion (e.g. HCPA). loop i loop jloop k foo()bar1()bar2() for (j=0 to 32) for (i=0 to 4) for (k=0 to 2) foo (); bar1(); bar2(); Shadow Mem

HCPA’s Outrageous Memory Overhead SuiteBenchmarkNative Memory (GB) w/ HCPA (GB) Mem. Exp. Factor Specbzip X mcf X gzip X NPBmg X cg X is X ft X Geomean GB59X 11 Before applying techniques in this paper… [PLDI 2011]

This Paper’s Contribution core NUMA w/ 512GB supercomputer center Macbook Air w/ 4GB student’s dorm room BEFOREAFTER Make shadow-memory based poly-scopic analysis practical by reducing memory overhead!

Outline Introduction Vector Shadow Memory Lightweight Tag Validation Efficient Storage Management Experimental Results Conclusion 13

What’s Wrong Using Conventional Shadow Memory Implementations? Setup / clean-up overhead at scope boundaries incurs significant memory and runtime overhead. For every memory access, all the scopes have to lookup a tag by traversing a multi-level table. loop i loop jloop k foo() bar1()bar2() Shadow Mem … Segment Table Tag Table 0 3 … P0P0 P1P1 … Multi-level Table

The Scope Model of Poly-Scopic Analysis What is a scope? –Scope is an entity with a single-entry. –Two scopes are either nested or do not overlap. Properties –Scopes form a hierarchical tree at runtime. –Scopes at the same level never overlap. loop i loop jloop k foo()bar1()bar2() for (j=0 to 32) for (i=0 to 4) for (k=0 to 2) foo (); bar1(); bar2();

Vector Shadow Memory Associates a tag vector to an address. –Scopes in the same level share the same tag storage. –Scope’s level is the index of a tag vector. Benefits –No storage setup / clean-up overhead. –A single tag vector lookup allows access to all the tags. 16 loop i loop jloop k foo()bar1()bar2() addrsizeLevel 0Level 1Level 2 0x40003tag[0]tag[1]tag[2] 0x40043tag[0]tag[1]tag[2] 0x40083tag[0]tag[1]tag[2] Tag Vector

Challenge: Tag Validation A tag is valid only within a scope, but scopes in the same level share the same tag storage. Need to check if each tag element is valid. 17 loop i loop jloop k foo() bar1() bar2() Tag vector of 0x4000 written in foo() Tag vector of 0x4000 read in bar2() Tag Validation Counting Memory Accesses in Each Scope How can we support a lightweight tag validation?

Challenge: Storage Management Tag vector size is determined by the level of the scope that accesses the address. Need to adjust the storage allocation as the tag vector size changes. 18 How can we efficiently manage storage without significant runtime overhead? 015 Tag Vector of 0x4000 Access from level 2 Access from level 9 Access from level 1 126……1 Vector Size EventStorage Op allocate expand shrink

Outline Introduction Vector Shadow Memory Lightweight Tag Validation Efficient Storage Management Experimental Results Conclusion 19

Overview of Tag Validation Techniques For tag validation, Skadu uses version that identifies active scopes (scopes on callstack) when a tag vector is written. –Baseline: store a version for each tag element. –SlimTV: store a version for each tag vector. –BulkTV: share a version for a group of tag vectors. 20 Ver Tag [N]Ver [N] Tag [N]Ver [N] …… Tag [N]Ver [N] Tag [N] … Ver … Tag [N] … Ver (a) Baseline(b) SlimTV(c) BulkTV

Baseline Tag Validation for each level If (Ver [level] != Ver_active[level]) { // invalidate the level Tag[level]  Init_Val Ver[level]  Ver_active[level] ); } 21 Tag [N]Ver [N] Tag [N]Ver [N] …… Tag [N]Ver [N] (a) Baseline

Slim Tag Validation: Store Only a Single Version // find max valid level j = find ( Ver, Ver_active[]); // scrub invalid tags from level j+1 memset ( & Tag[j+1], N-j-1, init_val); // update total-ordered version number Ver = Ver_active[current_level] 22 Ver Tag [N] … Ver … Clever trick: create a total ordering between all versions in all dynamic scopes (timestamp). Tag Validation: (b) SlimTV

Bulk Tag Validation: Share a Version Across Multiple Tag Vectors 23 Clever trick: Exploit memory locality and validate multiple tag vectors together. Benefit: Reduced memory footprint and more efficient per-tag vector invalidation. Downside: Some tag vectors might be never accessed after tag validation, wasting the validation effort. Tag [N] … Ver (c) BulkTV

Outline Introduction Vector Shadow Memory Lightweight Tag Validation Efficient Storage Management Experimental Results Conclusion 24

Baseline: Array-Based VSM Organization A tag vector is stored contiguously in an array. Efficient tag vector operations –loop through each array element. Expensive resizing –Resizing would require expensive array reallocation. –Unclear when to shrink a tag vector to reclaim memory. 25 T 0-0 T 1-0 T 2-0 … T 0-1 T 1-1 T 2-1 … T 0-… T 1-… T 2-… … Tag Vector Addr / Level L1… L0 0x4000 0x4004 0x4008 Array-Based VSM

Alternative: Level-Based VSM Organization Idea: reorganize tag storage by scope level so that like- invalidated tags are contiguous Efficient tag vector resizing –Resizing is part of tag validation. –Simply update a pointer in level table. –Dirty tag tables added to a free list and asynchronously scrubbed. Inefficient tag vector operations –Tag vectors are no longer stored contiguously. 26 T 0-1 T 1-1 … L0 L1 … Tag Table Level Table T 0-0 T 1-0 … …… NULL Scrubbed Level-Based VSM

Use Best of Both: Dual Representation 27 T 0-0 T 1-0 T 2-0 … T 0-1 T 1-1 T 2-1 … T 0-… T 1-… T 2-… … Tag Vector Addr / Level L2… L1 0x4000 0x8004 0x4008 Fast Execution For Recently Accessed Tags T 0-1 T 1-1 … L0 L1 … Tag Table Level Table T 0-0 T 1-0 … …… Efficient Storage For not recently Accessed Tags Implement Tag Vector Cache Using Arrays Implement Tag Vector Store Using Levels FillEvict

Triple Representation: Add Compressed Store for Very Infrequently Used Tag Vectors 28 T 1-2 T 2-2 … L1 L2 … Tag Table Level Table T 1-1 T 2-1 … …… Compressed Vector Tags Compressed Vector Tags Tag Vector Store FillEvict Compressed Tag Vector Store

Outline Introduction Vector Shadow Memory Lightweight Tag Validation Efficient Storage Management Experimental Results Conclusion 29

Experimental Setup Measure the peak memory usage. –Compare to our baseline implementation [PLDI 2011]. –Target Spec and NAS Parallel Benchmarks. HCPA –Tag: 64-bit timestamp, Version: 64-bit integer –Tag Vector Cache: covers 4MB of address space. –Tag Vector Store: 4 KB units. Memory Footprint Profiler –Please see the paper for details. 30

HCPA Memory Expansion Factor Reduction SlimTV Dual Representation (includes BulkTV) Triple (+ Compression) Final Memory Expansion Factors Native execution’s memory usage.

Conclusion Conventional shadow memory does not work with poly-scopic analysis due to excessive memory overhead. Skadu is an efficient vector shadow memory implementation designed for poly-scopic analysis. –Shares storage across all the scopes in the same level. –SlimTV and BulkTV: reduce overhead tag validation. –Novel Triple Representation for performance / storage. (Tag Vector Cache, Tag Vector Store, Compressed Store) Impressive Results –HCPA: 11.4X memory reduction from baseline implementation with only 1.2X runtime overhead. 32

*** 33

HCPA Speedup SlimTVVCache + VStorage + Dynamic Compression Final Memory Expansion Factors