Garbage Collection Advantage: Improving Program Locality

Slides:



Advertisements
Similar presentations
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 MC 2 –Copying GC for Memory Constrained Environments Narendran Sachindran J. Eliot.
Advertisements

Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.
1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.
An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.
MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.
An On-the-Fly Mark and Sweep Garbage Collector Based on Sliding Views Hezi Azatchi - IBM Yossi Levanoni - Microsoft Harel Paz – Technion Erez Petrank –
By Jacob SeligmannSteffen Grarup Presented By Leon Gendler Incremental Mature Garbage Collection Using the Train Algorithm.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
Using Prefetching to Improve Reference-Counting Garbage Collectors Harel Paz IBM Haifa Research Lab Erez Petrank Microsoft Research and Technion.
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
Free-Me: A Static Analysis for Individual Object Reclamation Samuel Z. Guyer Tufts University Kathryn S. McKinley University of Texas at Austin Daniel.
1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science CRAMM: Virtual Memory Support for Garbage-Collected Applications Ting Yang, Emery.
Connectivity-Based Garbage Collection Presenter Feng Xian Author Martin Hirzel, et.al Published in OOPSLA’2003.
U NIVERSITY OF M ASSACHUSETTS Department of Computer Science Automatic Heap Sizing Ting Yang, Matthew Hertz Emery Berger, Eliot Moss University of Massachusetts.
JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.
An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.
Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Garbage Collection Without Paging Matthew Hertz, Yi Feng, Emery Berger University.
1 Reducing Generational Copy Reserve Overhead with Fallback Compaction Phil McGachey and Antony L. Hosking June 2006.
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
1 Overview Assignment 6: hints  Living with a garbage collector Assignment 5: solution  Garbage collection.
The College of William and Mary 1 Influence of Program Inputs on the Selection of Garbage Collectors Feng Mao, Eddy Zheng Zhang and Xipeng Shen.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.
Taking Off The Gloves With Reference Counting Immix
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
An Adaptive, Region-based Allocator for Java Feng Qian, Laurie Hendren {fqian, Sable Research Group School of Computer Science McGill.
The Jikes RVM | Ian Rogers, The University of Manchester | Dr. Ian Rogers Jikes RVM Core Team Member Research Fellow, Advanced.
Institute of Computing Technology On Improving Heap Memory Layout by Dynamic Pool Allocation Zhenjiang Wang Chenggang Wu Institute of Computing Technology,
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
Adaptive Optimization with On-Stack Replacement Stephen J. Fink IBM T.J. Watson Research Center Feng Qian (presenter) Sable Research Group, McGill University.
P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.
Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.
CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental.
Copyright (c) 2004 Borys Bradel Myths and Realities: The Performance Impact of Garbage Collection Paper: Stephen M. Blackburn, Perry Cheng, and Kathryn.
Free-Me: A Static Analysis for Automatic Individual Object Reclamation Samuel Z. Guyer, Kathryn McKinley, Daniel Frampton Presented by: Dimitris Prountzos.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 Automatic Heap Sizing: Taking Real Memory into Account Ting Yang, Emery Berger,
Finding Your Cronies: Static Analysis for Dynamic Object Colocation Samuel Z. Guyer Kathryn S. McKinley T H E U N I V E R S I T Y O F T E X A S A T A U.
September 11, 2003 Beltway: Getting Around GC Gridlock Steve Blackburn, Kathryn McKinley Richard Jones, Eliot Moss Modified by: Weiming Zhao Oct
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
A REAL-TIME GARBAGE COLLECTOR WITH LOW OVERHEAD AND CONSISTENT UTILIZATION David F. Bacon, Perry Cheng, and V.T. Rajan IBM T.J. Watson Research Center.
Department of Computer Sciences Z-Rays: Divide Arrays and Conquer Speed and Flexibility Jennifer B. Sartor Stephen M. Blackburn,
Object-Relative Addressing: Compressed Pointers in 64-bit Java Virtual Machines Kris Venstermans, Lieven Eeckhout, Koen De Bosschere Department of Electronics.
Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin.
1 GC Advantage: Improving Program Locality Xianglong Huang, Zhenlin Wang, Stephen M Blackburn, Kathryn S McKinley, J Eliot B Moss, Perry Cheng.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.
Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.
Dynamic Compilation Vijay Janapa Reddi
Cork: Dynamic Memory Leak Detection with Garbage Collection
Rifat Shahriyar Stephen M. Blackburn Australian National University
Approaches to Reflective Method Invocation
CS 153: Concepts of Compiler Design November 28 Class Meeting
Department of Electrical & Computer Engineering
Ulterior Reference Counting Fast GC Without The Wait
David F. Bacon, Perry Cheng, and V.T. Rajan
Strategies for automatic memory management
Adaptive Code Unloading for Resource-Constrained JVMs
Correcting the Dynamic Call Graph Using Control Flow Constraints
Beltway: Getting Around Garbage Collection Gridlock
José A. Joao* Onur Mutlu‡ Yale N. Patt*
JIT Compiler Design Maxine Virtual Machine Dhwani Pandya
Program-level Adaptive Memory Management
CMPE 152: Compiler Design May 2 Class Meeting
Presentation transcript:

Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass), Zhenlin Wang (MTU), Perry Cheng (IBM) Special Thanks to authors for providing many of these slides

Motivation Memory gap problem OO programs become more popular OO programs exacerbates memory gap problem Automatic memory management Pointer data structures Many small methods Goal: improve OO program locality

Cache Performance Matters

Opportunity Generational copying garbage collector reorders objects at runtime

Copying of Linked Objects 1 1 4 4 2 2 3 3 6 6 7 7 5 5 Breadth First

Copying of Linked Objects 1 1 4 4 2 2 3 3 6 6 7 7 5 5 Breadth First 1 2 3 4 5 6 7 Depth First

Copying of Linked Objects 1 1 4 4 2 2 3 3 6 6 7 7 5 5 Breadth First 1 1 2 3 4 4 5 6 7 Depth First 1 1 2 3 5 4 4 6 7 Online Object Reordering

Outline Motivation Online Object Reordering (OOR) Methodology Experimental Results Conclusion

OOR System Overview Records object accesses in each method (excludes cold basic blocks) Finds hot methods by adaptive sampling Reorders objects with hot fields in older generation during GC Copies hot objects into separate region

Online Object Reordering Where are the cache misses? How to identify hot field accesses at runtime? How to reorder the objects?

Where Are The Cache Misses? Heap structure: VM Objects Stack Older Generation Nursery Not to scale

Where Are The Cache Misses?

Where Are The Cache Misses? Two opportunities to reorder objects in the older generation Promote nursery objects Full heap collection

How to Find Hot Fields? Runtime info (intercept every read)? Compiler analysis? Runtime information + compiler analysis Key: Low overhead estimation

Which Classes Need Reordering? Step 1: Compiler analysis Excludes cold basic blocks Identifies field accesses Step 2: JIT adaptive sampling identifies hot methods Mark as hot field accesses in hot methods Key: Low overhead estimation

Example: Compiler Analysis Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c Hot BB Collect access info Compiler Compiler Cold BB Ignore Access List: 1. A.b 2. …. ….

Example: Adaptive Sampling Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c Adaptive Sampling Foo Accesses: 1. A.b 2. …. …. Foo is hot A.b is hot A A’s type information c b ….. c b B

Copying of Linked Objects Type Information 1 3 4 1 1 4 4 2 2 3 3 6 6 7 7 5 5 Online Object Reordering Hot space Cold space

OOR System Overview Hot Methods Source Code Look Up Access Info Database Adaptive Sampling Adaptive Sampling Baseline Compiler Optimizing Compiler Optimizing Compiler Adds Entries Register Hot Field Accesses GC: Copies Objects GC: Copies Objects Executing Code Affects Locality Improves Locality Advice Input/Output JikesRVM component OOR addition

Outline Motivation Online Object Reordering Methodology Experimental Results Conclusion

Methodology: Virtual Machine Jikes RVM VM written in Java High performance Timer based adaptive sampling Dynamic optimization Experiment setup Pseudo-adaptive 2nd iteration [Eeckhout et al.]

Methodology: Memory Management Memory Management Toolkit (MMTk): Allocators and garbage collectors Multi-space heap Boot image Large object space (LOS) Immortal space Experiment setup Generational copying GC with 4M bounded nursery

Overhead: OOR Analysis Only Benchmark Base Execution Time (sec) w/ only OOR Analysis (sec) Overhead jess 4.39 4.43 0.84% jack 5.79 5.82 0.57% raytrace 4.63 4.61 -0.59% mtrt 4.95 4.99 0.70% javac 12.83 12.70 -1.05% compress 8.56 8.54 0.20% pseudojbb 13.39 13.43 0.36% db 18.88 -0.03% antlr 0.94 0.91 -2.90% hsqldb 160.56 158.46 -1.30% ipsixql 41.62 42.43 1.93% jython 37.71 37.16 -1.44% ps-fun 129.24 128.04 -1.03% Mean -0.19%

Detailed Experiments Separate application and GC time Vary thresholds for method heat Vary thresholds for cold basic blocks Three architectures x86, AMD, PowerPC x86 Performance counter: DL1, trace cache, L2, DTLB, ITLB

Discussion What will be the result of cache affinity on multicore systems ? Will memory affinity benefit the system more ? Do we need some modification in algorithm to increase cache affinity per core ? Will it significantly improve performance ?

Performance Implications of Cache Affinity on Multicore Processors Vahid Kazempour, Alexandra Fedorova and Pouya Alagheband EuroPar 2008 “We hypothesized that cache affinity does not affect performance on multicore processors: on multicore uniprocessors — because reloading the L1 cache state is cheap, and on multicore multiprocessors – because L2 cache affinity is generally low due to cache sharing.” “Even though upper-bound performance improvements from exploiting cache affinity on multicore multiprocessors are lower than on unicore multiprocessors, they are still significant: 11% on average and 27% maximum. This merits consideration of affinity awareness on multicore multiprocessors.”

Performance javac

Performance db

Performance jython Any static ordering leaves you vulnerable to pathological cases.

Phase Changes

Conclusion Static traversal orders have up to 25% variation OOR improves or matches best static ordering OOR has very low overhead Past predicts future

Discussion In experiments OOR gives performance benefit of 10-40% compared to full heap mark sweep collector. Is the comparison valid – As OOR runs with generational copy collector whereas mark sweep collects full heap at a time. How much benefit we are getting through generational version of copy collector ? In Myths and Realities paper we see GenMS is 30-45% better than full heap MS.

Questions? Thank you!