JOLT: REDUCING OBJECT CHURN AJ Shankar, Matt Arnold, Ras Bodik UC Berkeley, IBM Research | OOPSLA 2008 1.

Slides:

Advertisements

Similar presentations

Objects and Classes David Walker CS 320. Advanced Languages advanced programming features –ML data types, exceptions, modules, objects, concurrency,...

Advertisements

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Pallavi Joshi  Chang-Seo Park  Koushik Sen  Mayur Naik ‡  Par Lab, EECS, UC Berkeley‡

A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Go with the Flow: Profiling Copies to Find Run-time Bloat Guoqing Xu, Matthew Arnold, Nick Mitchell, Atanas Rountev, Gary Sevitsky Ohio State University.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Resurrector: A Tunable Object Lifetime Profiling Technique Guoqing Xu University of California, Irvine OOPSLA’13 Conference Talk 1.

Guoquing Xu, Atanas Rountev Ohio State University Oct 9 th, 2008 Presented by Eun Jung Park.

(1) ICS 313: Programming Language Theory Chapter 10: Implementing Subprograms.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.

Free-Me: A Static Analysis for Individual Object Reclamation Samuel Z. Guyer Tufts University Kathryn S. McKinley University of Texas at Austin Daniel.

OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

CS 536 Spring Run-time organization Lecture 19.

Finding Low-Utility Data Structures Guoqing Xu 1, Nick Mitchell 2, Matthew Arnold 2, Atanas Rountev 1, Edith Schonberg 2, Gary Sevitsky 2 1 Ohio State.

3/17/2008Prof. Hilfinger CS 164 Lecture 231 Run-time organization Lecture 23.

Connectivity-Based Garbage Collection Presenter Feng Xian Author Martin Hirzel, et.al Published in OOPSLA’2003.

Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001.

Previous finals up on the web page use them as practice problems look at them early.

1 Refinement-Based Context-Sensitive Points-To Analysis for Java Manu Sridharan, Rastislav Bodík UC Berkeley PLDI 2006.

Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.

Composing Dataflow Analyses and Transformations Sorin Lerner (University of Washington) David Grove (IBM T.J. Watson) Craig Chambers (University of Washington)

Compiler Optimizations for Nondeferred Reference-Counting Garbage Collection Pramod G. Joisha Microsoft Research, Redmond.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.

P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.

An Adaptive, Region-based Allocator for Java Feng Qian, Laurie Hendren {fqian, Sable Research Group School of Computer Science McGill.

Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.

Putting Pointer Analysis to Work Rakesh Ghiya and Laurie J. Hendren Presented by Shey Liggett & Jason Bartkowiak.

CSC 310 – Imperative Programming Languages, Spring, 2009 Virtual Machines and Threaded Intermediate Code (instead of PR Chapter 5 on Target Machine Architecture)

Spring 2014Jim Hogg - UW - CSE - P501X1-1 CSE P501 – Compiler Construction Inlining Devirtualization.

Week 9 Part 1 Kyle Dewey. Overview Dynamic allocation continued Heap versus stack Memory-related bugs Exam #2.

How to select superinstructions for Ruby ZAKIROV Salikh*, CHIBA Shigeru*, and SHIBAYAMA Etsuya** * Tokyo Institute of Technology, dept. of Mathematical.

Chameleon Automatic Selection of Collections Ohad Shacham Martin VechevEran Yahav Tel Aviv University IBM T.J. Watson Research Center Presented by: Yingyi.

1 Threads Chapter 11 from the book: Inter-process Communications in Linux: The Nooks & Crannies by John Shapley Gray Publisher: Prentice Hall Pub Date:

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

PRESTO: Program Analyses and Software Tools Research Group, Ohio State University Merging Equivalent Contexts for Scalable Heap-cloning-based Points-to.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

CoCo: Sound and Adaptive Replacement of Java Collections Guoqing (Harry) Xu Department of Computer Science University of California, Irvine.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.

Pointer and Escape Analysis for Multithreaded Programs Alexandru Salcianu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

 Presented By: Abdul Aziz Ghazi  Roll No:  Presented to: Sir Harris.

Design issues for Object-Oriented Languages

Optimistic Hybrid Analysis

Object Lifetime and Pointers

Compositional Pointer and Escape Analysis for Java Programs

Run-time organization

CS 153: Concepts of Compiler Design November 28 Class Meeting

Whole program compilation for embedded software: the ADSL experiment

CACHETOR Detecting Cacheable Data to Remove Bloat

Adaptive Code Unloading for Resource-Constrained JVMs

Inlining and Devirtualization Hal Perkins Autumn 2011

Inlining and Devirtualization Hal Perkins Autumn 2009

Correcting the Dynamic Call Graph Using Control Flow Constraints

Adaptive Optimization in the Jalapeño JVM

The Last Lecture COMP 512 Rice University Houston, Texas Fall 2003

Lecture 9 Dynamic Compilation

자바 언어를 위한 정적 분석 (Static Analyses for Java) ‘99 한국정보과학회 가을학술발표회 튜토리얼

Multithreading Why & How.

CMPE 152: Compiler Design May 2 Class Meeting

Presentation transcript:

JOLT: REDUCING OBJECT CHURN AJ Shankar, Matt Arnold, Ras Bodik UC Berkeley, IBM Research | OOPSLA

What is Object Churn? 2 [Mitchell, Sevitsky 06] int foo() { return bar().length; } String bar() { return new String(“foobar”); } String Allocation of intermediate objects with short lifespans

Churn is a Problem 3  A natural result of abstraction  Common in large component-based applications  Reduces program performance  Puts pressure on GC  Inhibits parallelization (temp objects are synchronized)  Requires unnecessary CPU cycles  Hard to eliminate  Escape analysis? Objects escape allocating function  Refactoring? It requires cross-component changes

Jolt: Our Contribution 4  Automatic runtime churn reduction  Lightweight dynamic analyses, simple optimization  Implemented in IBM’s J9 JVM  Ran on large component-based benchmarks  Removes 4x as many allocs as escape analysis alone  Speedups of up to 15%

Objects Escape Allocation Context 5  Traditional EA: hands tied  Several escape analyses explore up the stack to add context [Blanchet 99, Whaley 99, Gay 00]  Many contexts to explore, create to optimize an allocation site  Hard to scale to 10k functions, 500M context-sensitive allocs! Object

Houston, We Have a Solution 6 Jolt uses a two-part solution: 1.Dynamic analyses find churn-laden subprograms Rooted at a function Only as many contexts as functions in program Subprograms can contain many churned objects 2.Selectively inline portions of subprogram into root to create context Churned objects no longer escape context Can now run escape analysis Object Object

Step 1: Find Roots: Churn Analysis  Goal: Identify roots of churn-laden subprograms  Operate on static call graph (JIT’s domain)  Use dynamic heap information to track churn  Use three dynamic analyses inspired by [Dufour 07]:  Capture  %Capture  %Control 7

Overview of Analyses 8  Three analyses together pinpoint subprogram root Capture f %Capture %Control

Capture 9 Capture(f) = # objs allocated by f or descendants that die before f returns In example: Capture(f) = 4 Answers: Enough churn in the subprogram rooted at f to be worth optimizing? High Capture  YES C C C C f

%Capture 10 %Capture(f) = % objs allocated by f or descendants that die before f returns In example: %Capture(f) = 4/6 Answers: Better to root at f than at parent of f? High %Capture  YES C C C C f

%Control 11 %Control(f) = % objs allocated that are captured by f but not captured by descendants In example: %Control(f) = 3/6 Answers: Better to root at f than at child of f? High %Control  YES C C C f

All Together Now 12  Three analyses together pinpoint subprogram root Capture f %Capture %Control High Capture: Worth optimizing High %Capture: Better f than parent High %Control: Better f than child

How to Compute Analyses 13  Goals:  Efficient runtime mechanism  Thread-safe  Simple to add to existing JIT code  Solution: Track heap allocation pointer, GC  Requires thread-local heaps (TLHs) & copy collector Supported by virtually all modern JVMs Use runtime sampling to control overhead  Alternative solution works for any JVM + GC Details in Appendix

Computing Analyses with TLHs 14 tlhp(start(f)) tlhp(start(c 1 )) tlhp(end(c 1 )) tlhp(start(c 2 )) tlhp(end(c 2 )) tlhp(end(f)) 1.Choose to sample function f 2.Track thread local heap alloc pointer through f’s child calls 3.Run GC at the end of f 4.Compute capture and control Capture(f) %Capture(f) = |Capture(f)| / |Alloc(f)| %Control(f) = |Capture(f)|-  |Capture(c)| |Alloc(f)| Alloc(f) Computed from sampling runs on children

Step 2: Optimize: Smart Inlining 15  Churn analyses identified subprogram roots  Now, inline subprogram to expose allocs to EA  Respect JIT optimization constraints (size bound)  We can do better than inlining whole subprogram Root Object Only need to inline functions that add churned allocation sites to root

Step 2: Optimize: Smart Inlining 16  Goal: inline descendants that expose most # of churned allocs to EA  While still respecting size bound  NP-Hard problem! (can solve Knapsack) Root Which children to inline to get closest to size bound without exceeding it?

Knapsack Approximation 17  Simple poly-time approximation:  Inline child with greatest ratio of object allocations to code size ↑ %capture(f)  objs alloc’d in c are churned  Repeat until size limit is reached  But greedy = short-sighted! Root A B B will never be inlined because A will never be inlined Root Low ratio High ratio

Churn Analyses to the Rescue 18  Would like to inline child if its subprogram has churn elimination potential  We already have an approximation: alloc(c)  Recall that alloc(c) is num allocs in entire subprogram  So: feed Knapsack approx alloc(c) instead of number of local object allocations in c Root High alloc A inlined because subprogram has high alloc; then B inlined A B

Eliminating Allocations 19  Once descendants have been inlined, pass to Escape Analysis  Use JIT’s existing EA  Because of smart inlining, objects’ allocation sites in f, lifetimes don’t escape f  EA eliminates allocations via stack allocation or scalar replacement  Bonus: improvements in EA == better JOLT

Experimental Methodology 20  Implemented Jolt in IBM’s J9 JVM  Fully automatic, transparent  Ran on large-scale benchmarks  Eclipse  JPetStore on Spring  TPC-W on JBoss  SPECjbb2005  DaCapo

Results ProgramBase %Objs ElimJolt %Objs ElimRatioSpeedup Eclipse0.4%1.9%4.6x4.8% JPetStore on Spring0.7%2.5%3.5x2.2% TPCW on JBoss0.0%4.3%1.7x10 3 x2.6% SPECjbb %32.1%3.4x13.2% DaCapo3.4%13.3%3.9x4.4%  EA performs poorly on large component-based apps  Median ratio: 4.3x as many objects removed by Jolt  Still many more to go  Median speedup: 4.8% 21

Additional Experiments 22  Runtime overhead acceptable  Average compilation overhead: 32% longer Acceptable for long-running programs (< 15 s) Often outweighed by speedup  Average profiling overhead: 1.0% Run at 1 sample per 100k function invocations  Combination of churn analyses and inlining performs better than either alone  In every case, Jolt outperformed separate configurations

Summary 23  Jolt reduces object churn  Transparently at runtime  In large applications  Easy to implement Uses existing JIT technologies heavily  Two-part approach  Dynamic churn analyses: capture and control Pinpoint roots of good subprograms  Smart inlining Uses analyses and Knapsack approximation to inline beneficial functions into root  Thanks!

Insight: One Context, Lots o’ Churn 24  Objects escape allocation context  Traditional EA: hands tied  Several escape analyses explore up the stack to add context [Blanchet 99, Whaley 99, Gay 00]  Many objects = many contexts  Hard to scale to 10k functions, 500M context-sensitive allocs!  Jolt: One context, many objects Object

Finding the Perfect Contexts 25  (Naïve) solution:  Inline everything into one mega-function Churned objects die in allocation context!  Run standard escape analysis Removes churned object allocs  Simple, reuses JVM components  Totally impractical!  Programs have 10k+ functions  Exponential inlining, JIT bound … …… … … … … … …

Houston, We Have a Solution 26 Jolt has a two-part solution: 1. Find root functions of subprograms with lots of churn 2. Inline only portions of subprograms that allocate churn  Then run escape analysis as usual Churned objs no longer escape context  Retains appeal of naïve solution Simple Reuses JVM components … …… … … … … … …