OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel.

Slides:

Advertisements

Similar presentations

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 1 MC 2 –Copying GC for Memory Constrained Environments Narendran Sachindran J. Eliot.

1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.

Performance of Cache Memory

On-the-Fly Garbage Collection Using Sliding Views Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni, Hezi Azatchi,

Incorporating Generations into a Modern Reference Counting Garbage Collector Hezi Azatchi Advisor: Erez Petrank.

Garbage Collection What is garbage and how can we deal with it?

Garbage Collecting the World Bernard Lang Christian Queinnec Jose Piquer Presented by Yu-Jin Chia See also: pp text.

Reducing Pause Time of Conservative Collectors Toshio Endo (National Institute of Informatics) Kenjiro Taura (Univ. of Tokyo)

MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.

Garbage Collection  records not reachable  reclaim to allow reuse  performed by runtime system (support programs linked with the compiled code) (support.

Garbage Collection CSCI 2720 Spring Static vs. Dynamic Allocation Early versions of Fortran –All memory was static C –Mix of static and dynamic.

An On-the-Fly Mark and Sweep Garbage Collector Based on Sliding Views Hezi Azatchi - IBM Yossi Levanoni - Microsoft Harel Paz – Technion Erez Petrank –

On-the-Fly Garbage Collection: An Exercise in Cooperation Edsget W. Dijkstra, Leslie Lamport, A.J. Martin and E.F.M. Steffens Communications of the ACM,

MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.

Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.

ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.

Using Prefetching to Improve Reference-Counting Garbage Collectors Harel Paz IBM Haifa Research Lab Erez Petrank Microsoft Research and Technion.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

Memory Management Design & Implementation Segmentation Chapter 4.

An On-the-Fly Reference Counting Garbage Collector for Java Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni – Microsoft.

MOSTLY PARALLEL GARBAGE COLLECTION Authors : Hans J. Boehm Alan J. Demers Scott Shenker XEROX PARC Presented by:REVITAL SHABTAI.

Generational Stack Collection And Profile driven Pretenuring Perry Cheng Robert Harper Peter Lee Presented By Moti Alperovitch

Connectivity-Based Garbage Collection Presenter Feng Xian Author Martin Hirzel, et.al Published in OOPSLA’2003.

Memory Organization.

Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.

1 An Efficient On-the-Fly Cycle Collection Harel Paz, Erez Petrank - Technion, Israel David F. Bacon, V. T. Rajan - IBM T.J. Watson Research Center Elliot.

Damien Doligez Georges Gonthier POPL 1994 Presented by Eran Yahav Portable, Unobtrusive Garbage Collection for Multiprocessor Systems.

Uniprocessor Garbage Collection Techniques Paul R. Wilson.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

G1 TUNING Shubham Modi( ) Ujjwal Kumar Singh(10772) Vaibhav(10780)

Reference Counters Associate a counter with each heap item Whenever a heap item is created, such as by a new or malloc instruction, initialize the counter.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Garbage Collection Memory Management Garbage Collection –Language requirement –VM service –Performance issue in time and space.

1 Overview Assignment 6: hints  Living with a garbage collector Assignment 5: solution  Garbage collection.

SEG Advanced Software Design and Reengineering TOPIC L Garbage Collection Algorithms.

ISMM 2004 Mostly Concurrent Compaction for Mark-Sweep GC Yoav Ossia, Ori Ben-Yitzhak, Marc Segal IBM Haifa Research Lab. Israel.

Ulterior Reference Counting: Fast Garbage Collection without a Long Wait Author: Stephen M Blackburn Kathryn S McKinley Presenter: Jun Tao.

A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization David Bacon Perry Cheng (presenting) V.T. Rajan IBM T.J. Watson Research.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia.

1 Real-Time Replication Garbage Collection Scott Nettles and James O’Toole PLDI 93 Presented by: Roi Amir.

Incremental Garbage Collection Uwe Kern 23. Januar 2002

CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak

Computer Science Department Daniel Frampton, David F. Bacon, Perry Cheng, and David Grove Australian National University Canberra ACT, Australia

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

Hans-J. Boehm Alan J. Demers Scott Shenker Presented by Kit Cischke.

Concurrent Garbage Collection Presented by Roman Kecher GC Seminar, Tel-Aviv University 23-Dec-141.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

A REAL-TIME GARBAGE COLLECTOR WITH LOW OVERHEAD AND CONSISTENT UTILIZATION David F. Bacon, Perry Cheng, and V.T. Rajan IBM T.J. Watson Research Center.

Virtual Memory Various memory management techniques have been discussed. All these strategies have the same goal: to keep many processes in memory simultaneously.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

CS412/413 Introduction to Compilers and Translators April 21, 1999 Lecture 30: Garbage collection.

Reference Counting. Reference Counting vs. Tracing Advantages ✔ Immediate ✔ Object-local ✔ Overhead distributed ✔ Very simple Trivial implementation for.

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

Garbage Collection What is garbage and how can we deal with it?

Java 9: The Quest for Very Large Heaps

Concepts of programming languages

David F. Bacon, Perry Cheng, and V.T. Rajan

Strategies for automatic memory management

Adaptive Code Unloading for Resource-Constrained JVMs

Memory Management Kathryn McKinley.

Virtual Memory: Working Sets

10/18: Lecture Topics Using spatial locality

Reference Counting.

Garbage Collection What is garbage and how can we deal with it?

Mooly Sagiv html:// Garbage Collection Mooly Sagiv html://

Reference Counting vs. Tracing

The Design and Implementation of a Log-Structured File System

Presentation transcript:

OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel Erez Petrank - Technion. Israel

IBM Labs in Haifa OOPSLA Outline  The mostly concurrent garbage collection (GC)  Internals of the collector  Write barrier and card table  Incremental collection  Our two improvements  And their implications on performance  Results  Conclusions

IBM Labs in Haifa OOPSLA Mark Sweep Stop-The-World (STW) Garbage Collection  The basic method  Mark all objects that are reachable from roots  Sweep - reclaim all unmarked objects  Done when Java mutation is suspended (STW)  Pause time - the length of the STW phase  Motivation for the mostly concurrent GC  Reduce the pause time at acceptable throughput hit

IBM Labs in Haifa OOPSLA Mostly Concurrent GC - The Basic Method  Perform marking concurrently with Java mutation  Traditionally done by a separate thread  While concurrent marking is active, record changes in objects  Otherwise…  When marking terminates do a short STW phase  Re-trace from  Roots  Marked objects that were not traced yet  Marked and changed objects  Sweep

IBM Labs in Haifa OOPSLA Mostly Concurrent GC - Perspective  Related Work  Steele and Dijkstra et al - Concurrent GC  Baker - Incremental collection  Boehm at al - Mostly concurrent collection  Printezis and Detlefs - Mostly concurrent and generational GC  Many others…  Ossia at al - Parallel, incremental and concurrent GC  Status  Collector well accepted both in academic research and industry  Used in many production JVMs: IBM, Sun, BEA JRockit

IBM Labs in Haifa OOPSLA Outline  The mostly concurrent garbage collection (GC)  Internals of the collector  Write Barrier and card table  Incremental collection  Our two improvements  And their implications on performance  Results  Conclusions

IBM Labs in Haifa OOPSLA The Write Barrier and Object “cleaning” interaction Tracer: Marks and traces Java Mutator: Modifies Blue and Green objects Write barrier on objects Tracer: Traces rest of graph Tracer: Clean blue object

IBM Labs in Haifa OOPSLA Mostly Concurrent GC – Card Cleaning  Heap is logically divided into cards  A card table is used, with a byte entry per each heap card  A card-marking write barrier  Whenever a reference field is modified, dirty the card table entry of the modified object  Card cleaning  Clean dirty mark  Retrace all marked objects on card  Card cleaning can be done concurrently  While this is done, more cards will be dirtied  Additional STW card cleaning phase must be done

IBM Labs in Haifa OOPSLA Incremental Mostly Concurrent GC  Marking done by the Java mutator threads  When allocating  Tracing Rate (TR) - configurable by user  The ratio between requested allocation size and required tracing work  Per every allocation request of K bytes, trace K* TR bytes of objects  Allocation rate of application & tracing rate of collector imply CPU percentage dedicated to the concurrent collection  Starting the concurrent collector  Must be done on time, to complete tracing when the heap is exhausted

IBM Labs in Haifa OOPSLA Concurrent Behavior Patterns  Higher tracing rate implies shorter concurrent cycle with smaller CPU share for Java  Numbers below refer to SPECjbb  STW GC  100% CPU for Java mutation  Mostly Concurrent  Tracing Rate 8  28% CPU for Java mutation  Mostly Concurrent  Tracing Rate 1  72% CPU for Java mutation CPU Utilization Time Java mutation Incremental Tracing Parallel STW

IBM Labs in Haifa OOPSLA Mostly concurrent GC – Summary of the Base Algorithm  Fast card marking write barrier always active (JITed)  Kickoff concurrent tracing when free space reached kickoff point  Reset the card table  Trace (incrementally) all objects reachable from roots  Do a single concurrent card cleaning pass on the card table  Initiate final (short) STW phase  Trace again the roots for new objects  Do another card cleaning pass  Trace all newly marked objects  Sweep

IBM Labs in Haifa OOPSLA Outline  The mostly concurrent garbage collection (GC)  Internals of the collector  Write Barrier and card table  Incremental collection  Our two improvements  And their implications on performance  Results  Conclusions

IBM Labs in Haifa OOPSLA The Repetitive Work Problem  Observations  Suppose a card is dirtied while concurrent marking is executing  Newly reached objects in this card are marked and traced by the collector  All these objects will later be traced again in the card cleaning phase  Outcome: repeated tracing  Improvement: Don’t trace through dirty cards

IBM Labs in Haifa OOPSLA Don’t Trace Through Dirty Cards  If an object resides in a dirty card, omit its tracing  Only a single tracing will be done, in the concurrent card cleaning phase  Advantages  Less marking work  Reduced floating garbage  More later…  Reduced cache miss rate  More later…  Thus, substantial throughput improvements  Disadvantage  Increased pause time Java mutation Concurrent tracing STW tracing Concurrent card cleaning STW card cleaning Base Method Don’t trace dirties

IBM Labs in Haifa OOPSLA Timing of Card Dirtying  Observations  Card (and object) dirtying indicates that previous tracing may have been insufficient  New objects may be reachable only from the dirtied object  Dirtying information is needed only if tracing was already done  Prior to tracing, card dirtying is irrelevant  Improvement: Undirdy cards with no traced objects  Undirtying via scanning  Undirtying via local allocation caches

IBM Labs in Haifa OOPSLA Undirtying via Scanning  Undirtying can be done periodically on the whole card table  An indication of “traced” cards is needed  Mark bit vector  A “traced” card table (first traced object marks the card as traced)  Method: scan the card table and undirty all cards with no traced objects  Very effective in undirtying cards (cuts cleaning by 65%)  Some extra cost of card table scan  Should be done frequently  Catch these cards before any marking or tracing occurs!

IBM Labs in Haifa OOPSLA Undirtying via Local Allocation Caches  Local allocation caches are used by most modern JVMs  Most cards of active caches are dirty  Objects usually have write barriers while (and shortly after) initializing  Objects in active cache are (usually) not traced before the cache is replaced  If no tracing in the active cache is guarantied, we can undirty its cards  Method: cooperation between allocators and concurrent tracers  Allocator (when replacing a local cache):  Undirty all the cache’s inner cards  Mark all cards as “traceable”  Take a new cache and mark all its cards as “untraceable”  Concurrent tracer  Defer tracing of objects in “untraceable” cards “for a while”  BTW, this hardly ever happens  Cuts the amount of dirty cards by more than 35%, at no cost

IBM Labs in Haifa OOPSLA Undirdy Cards with No Traced Objects  Advantages  Without “Don’t trace through dirty cards” – less work  With it, reduces the STW card cleaning significantly Java mutation Concurrent tracing STW tracing Concurrent card cleaning STW card cleaning Base Method Don’t trace dirties Don’t trace dirties + Undirty

IBM Labs in Haifa OOPSLA Characteristics of Dirty Cards  We believe that a recently dirtied card is good indication for more modification of objects in the near future  Change of references  Other writing activities  Indication applied to all the objects in the card  A recently dirtied card is probably hot and active  But we don’t trace through dirty cards!  By the time we get to clean them they will probably become more stable and colder

IBM Labs in Haifa OOPSLA Reduced Floating Garbage  Floating garbage is created when tracing is done before the object modification  Don’t trace through dirty cards!  Will probably defer the tracing until the card gets stable  Objects are no longer modified  No floating garbage will be created as a result of this late tracing

IBM Labs in Haifa OOPSLA Reduced Cache Miss Rate  Reducing the tracing work affects the cache miss rate  As tracing the object graph intensifies cache capacity misses  But also cache coherency misses are reduced  A write barriered card (hot and active) is probably modified by Java mutators  If a concurrent tracer scans objects on such card, it will suffer coherency misses  Don’t trace through dirty cards!  Deferring the tracing of these objects to the card cleaning phase reduces cache coherency misses  Our improved collector reduces L2 cache miss rate by 6.4%  Out of which 3.7% is reduction in cache coherency misses

IBM Labs in Haifa OOPSLA Outline  The mostly concurrent garbage collection (GC)  Internals of the collector  Write Barrier and card table  Incremental collection  Our two improvements  And their implications on performance  Results  Conclusions

IBM Labs in Haifa OOPSLA Implementation and Tests  Implementation  On top of the mostly concurrent collector that is part of the IBM production JVM  Platforms  Tested on both an IBM 6-way pSeries server and an IBM 4-way Netfinity server  Benchmarks  The SPECjbb2000 benchmark and the SPECjvm98 benchmark suite  Measurements  Performance of the base collector Vs. the improved version  The effect of each improvement separately, and more…

IBM Labs in Haifa OOPSLA Results - Throughput Improvement  SPECjbb. 6-way PPC. Heap size 448 MB  26.7% improvement (in tracing rate 1)

IBM Labs in Haifa OOPSLA Results - Floating Garbage Reduction  SPECjbb. 6-way PPC. Heap size 448 MB  13.4% improvement in heap residency (in tracing rate 1)  Almost all floating garbage eliminated

IBM Labs in Haifa OOPSLA Results - Pause Time Reduction  SPECjbb. 6-way PPC. Heap size 448 MB  33.3% improvement in average pause  36.4% improvement in max pause

IBM Labs in Haifa OOPSLA Conclusions  Introducing two improvements to the mostly concurrent GC  Reduces repetitive GC work (don’t trace through dirty cards)  Reduces number of dirty cards (undirty cards with no traced objects)  Substantial improvement of the mostly concurrent GC  Improved throughput by 26%  Almost eliminated floating garbage (heap residency reduced by 13%)  Reduced average pause time by 33%  Additional effects of not tracing into dirty cards  Reduced floating garbage  Reduced cache miss rate  The improved algorithm has been incorporated into IBM's production JVM

IBM Labs in Haifa OOPSLA End

IBM Labs in Haifa OOPSLA Analyzing the Performance of Lower Tracing Rate  Throughput hit rate  Relative to MS STW GC  Java utilization  Relative to MS STW GC  Live Rate  Relative to heap size  Floating Garbage  Marked objects that become unreachable before the STW phase  More objects to trace. Less free space (more GCs)  Card cleaning rate  Relative to total number of cards  More work. Longer final STW phase

IBM Labs in Haifa OOPSLA Results - Throughput Improvement  SPECjbb. 6-way PPC. Heap size 448 MB  26.7% improvement in tracing rate 1

IBM Labs in Haifa OOPSLA Results - Floating Garbage Reduction  SPECjbb. 6-way PPC. Heap size 448 MB  13.4% improvement in tracing rate 1

IBM Labs in Haifa OOPSLA Results - Average Pause Time  SPECjbb. 6-way PPC. Heap size 448 MB

IBM Labs in Haifa OOPSLA Throughput Improvement for All Tracing Rates

IBM Labs in Haifa OOPSLA Heap Residency Reduction for All Tracing Rates

IBM Labs in Haifa OOPSLA

IBM Labs in Haifa OOPSLA Reduced floating garbage  Potential Floating garbage root – reachable object that de-reference its sub-graph and thus make it unreachable  To become a floating garbage root, it must first be traced and then have a write barrier  We believe that a freshly dirty card is good indication for more write barriers  Deferring the tracing into a dirty card will defer the tracing to after the write barriers