IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia.

IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia

HRLHRL Motivation  Modern SMP servers introduce  Higher level of true parallelism  Multi-gigabyte heaps  Multi-threaded applications which must ensure fast response time  New demands from GC  Short pause time on large heaps  Minimal throughput hit  Scaleability on multi-processor hardware  Efficient algorithms for weak ordering hardware  We will not talk about this....  Workarounds, which do not work...  Bigger heaps  Object pooling

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Coding issues  Results (highlights)  Recent work Outline

HRLHRL Goals of this lecture  Present the ideas and algorithms  As done in the IBM JVM  Introduce the concerns of implementation  The devil's in the details  What's cheap and what's expensive  How does this gets into the design  Concurrent coding, the real story  Difficulties, and how to avoid them

HRLHRL Mark Sweep Compact GC (MSC)  Mark - traces all reachable (live) objects in heap  Needed data structures and operations:  Mark Stack (push & pop of references to objects)  SetMarked (Object *obj) & boolean IsMarked (Object *obj)  Start from all roots (Threads' stacks, system classes, etc)  Mark, and Push to mark stack  Process the mark stack, till empty  Pop Obj, and trace (mark and push all unmarked references from Obj)  Sweep  Coalesce unmarked objects into free chunks  Create a list of free chunks  Compact (the disaster zone)  Usually done when allocation request cannot be satisfied  Move and coalesce the live objects, to create bigger free chunks  Usually very long, unattractive

HRLHRL The MSC Process StacksGlobals Mark Compact ? Sweep Used Marked Free Dark matterUnmovable

HRLHRL MSC Subtle Issues  Mark  Proportional to amount of live objects only.  Risk of mark stack Overflow (e.g., linked list with Next as last field)  Avoid paging and cache misses  Sweep  Walk all the heap again  Naive method is proportional to amount of live AND dead objects  Partially solved by mark bit-vector  Bit for each basic heap unit (typically 8 bytes) Object mapped to specific bit  Walk the mark bit-vector, inspect heap only when found big holes in vector  Compact  Tradeoff : level of compaction Vs. speed  If not type-accurate GC, not all objects can be moved  Can't tell if a slot on stack is a reference, or a numeric value

HRLHRL Parallel STW MSC on N-Way  Usually GC thread per processor  Parallel mark  Synchronized marking  Load balancing needed (overflow, starvation)  Separate mark stacks for each thread  Stealing from designated "private" areas, attached to mark stacks (Endo et al)  Direct access to mark stacks of other threads (Flood et al)  Adding a single shared stack (Cheng & Blelloch)  Parallel sweep  Heap divided to M areas M > f * N  Synchronization needed for area selection, and free list concatenation  Compact  Tradeoff : Parallelism Vs. auxiliary data structures

HRLHRL The Concurrent Collection Principle  STW may get to seconds  Cost of mark phase dominant  Mark may be done while program is active, except that the objects graph changes...  Correctness kept by use of Write barrier  Is activated at each change of reference field in an object (gray it!)  Functionality is algorithm-dependent, sometimes expensive

HRLHRL The Concurrent Collection What, Who, How  Mostly concurrent MSC  Tracing done while mutator threads are active (Boehm et al, Printezis & Detlefs)  Retrace (Clean) may be done while mutator threads are active  Short final STW  Last clean and resulting tracing, and Sweep  Originally done by separate thread/processor  "Real" concurrent  Incremental work  Tracing done incrementally by the mutator threads  First done on a copying collector (Baker)  Parallel execution  Concurrent phase is also parallel  Many threads can do concurrent work simultaneously

HRLHRL The IBM Mostly Concurrent GC  Putting together all existing elements  First production-level parallel incremental mostly concurrent MSC collector  Combining incremental and concurrent tracing  Efficient concurrent marking that terminates on time  New mechanism for parallel load balancing  Especially fit for dynamic number of participating threads  When compared to mature industry-quality GC  Drastic reduction in pause time (more than 75%)  Small throughput hit (~10%)

HRLHRL Phases of The Collector  Concurrent phase  Tracing of all reachable objects  Done incrementally by Java mutators and dedicated low-priority tracing threads  Write Barrier records changes per region (card) in a card table  Any change of reference in an object dirty the card  Alll black objects in card changed to gray  Fast and relatively cheap operation (2% - 6% throughput hit)  A single card cleaning pass  In each dirty card, retrace all marked objects  Cleaning may precede the actual tracing  Final STW phase  Root scanning and final card cleaning pass  Tracing of all additional objects  Parallel sweep  Now replaced by concurrent sweep

HRLHRL CPU Distribution

HRLHRL Write Barrier  Activated by the JVM, on each reference change done in Java  Writes in a card table  Each card 512 bytes in heap  Cleaning (concurrent or final) may happen anytime  Foo.a = O1  Store O1 in a root (guarantied to be reachable)  Set Foo.a to O1  Activate Write Barrier on Foo  Dirty the enrty of Foo in the card table  Remove O1 from root  Object may span on many cards  Usually mapped to the card where its header starts

HRLHRL The Problem of Punctual termination  Traditional STW Collection starts when heap is full  produces minimal number of GCs  Mostly concurrent aims at completing the Concurrent Marking when heap becomes full  If heap get filled before CM terminates, rest of marking moved to final phase  Longer pause  If CM terminate before heap is filled, choose the lesser of two evils:  Wait for heap to be filled (and accumulate more dirty cards)  Initiate an "early" GC (more GCs, with all their additional costs)  Concurrent marking should be adaptive

HRLHRL Combining Concurrent and Incremental  Existing approaches  Incremental  Mutators do tracing proportional to allocation (Tracing Rate)  Tracing guarantied to terminate (more or less) on time  Decreases application performance  Specialized GC threads (concurrent)  More efficient, better CPU utilization  Tracing rate determined by ratio between GC and program threads  Usually not changed by the collector  No control of termination point  Hybrid usage of both ways  Low priority background threads fully utilize CPU idle time  Not controlled by the tracing rate  Mutators perform incremental tracing, to ensure proper termination  Only if tracing goals not met by background threads  Control "Milestones" (concurrent start, card cleaning pass start, etc.)

HRLHRL Metering Formulas  Kickoff point of concurrent  User-specified Tracing Rate (TR)  Live Objects estimation (L est ), Dirty objects estimation (M est )  Start concurrent when free space gets below (Lest+M est ) / TR  So tracing rate, applied to remaining allocations, match the tracing work  Calculating The work  Amount of concurrently traced objects (T raced )  Amount of remaining free memory (F ree )  Estimated background threads tracing rate (B est )  Rate between total amounts of background tracing and allocations  Dynamically recalculate actual rate ATR = ((L est + M est - T raced ) / F ree )  Rate between remaining work and free space  Account for work done in background ATR 2 = ATR - B est  Trace only if background tracing lags

HRLHRL Behavior Patterns CPU Usage Time Parallel STW Java mutation Incremental tracing Background tracing STW MSC GC Throughput 100% Con. Tr Rate 3 Throughput 80% Con. Tr Rate 8 Throughput 90% Con. Tr Rate 8 CPU 80% Throughput 95% Con. Tr Rate 8 CPU 50% Throughput 110% Time

HRLHRL Load Balancing - the Problem  Even distribution of objects between parallel tracing threads  Avoid mark stack overflow and/or starvation of threads  Suitable for unknown number to collection threads  Efficient synchronization  Supply simple termination detection  Existing approaches  All use separate mark stacks for each thread  Stealing from designated "private" areas, attached to mark stacks (Endo et al)  Direct access to mark stacks of other threads (Flood et al)  Adding a single shared stack (Cheng & Blelloch)

HRLHRL Load Balancing for Concurrent  Pools of WorkPackets  Each is a smaller mark stack  Cheap Get/Put synchronization  (compare & swap)  Separate pools for different occupancy  Each pool maintains a counter  Tracing thread uses 2 WorkPacket  Objects are popped from the input WP  Newly marked object are pushed to the output WP  Empty input WP is returned to the "Empty" pool  New "as-full-as-possible" WP is then fetched  Full output WP is returned to the "Full" pool  New "as-empty-as-possible" WP is then fetched  Different object graph traversal  BFS, limited by the capacity of a WorkPacket Full Non-Full Non-Empty Empty

HRLHRL Advantages of WorkPackets  Fair competition when input is scarce  All threads get same chance for tracing input  Simple detection of tracing state  Overflow - All packets are full  Scaleability is possible, simply allocate more WPs  Starvation - Only empty WPs available, but not all WPs in the "Empty" list  Termination - all WPs in the "Empty" list  Positive results measured  Low cost of synchronization  Fair distribution of work among threads

HRLHRL Concurrent Code Maintenance  Extremely difficult to verify:  Races between concurrent tracing and program  Races between concurrent tracers  Timing is a major factor  Debug version cannot reproduce Release bugs  Problems surface only occasionally  Behavior is machine dependent  About 40% verification code  Sanity checks  Asserts, consistency checks  Logging of collection activity, state, and history  Shadow heap, for tracing history  Shadow card table, for card state and treatment  Code to use the above for printing detailed information.

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Results (highlights)  Recent work Outline

HRLHRL Comparison with STW GC  Compared to STW MSC  Using IBM's production level JVM  4-way machines  NT, AIX and IA64  Mostly testing SPECjbb  Server-side Java  Throughput driven  60% live objects  Pause time cut by 75%  Mark time by 86%.  Sweep become dominant  Throughput hit of 10%

HRLHRL Comparison with STW GC (cont.)  Also testing pBOB  IBM internal benchmark  Fit for 2.5 GB heap, with  low CPU utilization  Many threads

HRLHRL Effects of Various Tracing Rates  Mutator utilization - amount of Java mutation  done during the concurrent phase  Also controls the size of the "per thread mini-STW"

HRLHRL Effects of Various Tracing Rates  Floating garbage - marked objects that become unreachable before the final STW phase.  Amount of cards cleaned

HRLHRL References  A parallel, incremental and concurrent GC for servers.  Ossia, Ben-Yitzhak, Goft, Kolodner, Leikehman, Owshanko. PLDI '02  Mostly parallel garbage collection.  Boehm, Demers, Shenker. ACM SIGPLAN Notices, 1991.  On-the-fly Garbage Collection: An exercise in cooperation.  Dijkstra, Lamport, Martin, Scholten, Steffens. ACM comm., 1978.  A generational mostly-concurrent garbage collector.  Printezis, Detlefs. ISMM 2000  And many more...

HRLHRL  Dividing the concurrent work  Parallel load balancing mechanism  Results (highlights)  Recent work  Introduction Outline

HRLHRL Concurrent Sweep  Sweep became the dominant part of the remaining pause time  Except for the  needed initial allocation,  rest of sweep can be  deferred  Concurrent sweep done incrementally  After the final phase, and before the next concurrent collection  Work done on each allocation request  No additional performance cost

HRLHRL Improving the Low Tracing Rate  Low tracing rate is more application friendly  More CPU left to program  Shorter tracing periods forced on threads  But throughput is reduced  Goal : Improve throughput with minimal hit on pause times  Achieved by reducing dirty cards, and floating garbage.  Better performance  Reduced heap residency

IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia.

Similar presentations

Presentation on theme: "IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia.

Similar presentations

Presentation on theme: "IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia."— Presentation transcript:

Similar presentations

About project

Feedback