Download presentation
Presentation is loading. Please wait.
Published byJoshua Taylor Modified over 8 years ago
1
IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia
2
HRLHRL Motivation Modern SMP servers introduce Higher level of true parallelism Multi-gigabyte heaps Multi-threaded applications which must ensure fast response time New demands from GC Short pause time on large heaps Minimal throughput hit Scaleability on multi-processor hardware Efficient algorithms for weak ordering hardware We will not talk about this.... Workarounds, which do not work... Bigger heaps Object pooling
3
HRLHRL Introduction Principles of concurrent collector Dividing the concurrent work Parallel load balancing mechanism Coding issues Results (highlights) Recent work Outline
4
HRLHRL Goals of this lecture Present the ideas and algorithms As done in the IBM JVM Introduce the concerns of implementation The devil's in the details What's cheap and what's expensive How does this gets into the design Concurrent coding, the real story Difficulties, and how to avoid them
5
HRLHRL Mark Sweep Compact GC (MSC) Mark - traces all reachable (live) objects in heap Needed data structures and operations: Mark Stack (push & pop of references to objects) SetMarked (Object *obj) & boolean IsMarked (Object *obj) Start from all roots (Threads' stacks, system classes, etc) Mark, and Push to mark stack Process the mark stack, till empty Pop Obj, and trace (mark and push all unmarked references from Obj) Sweep Coalesce unmarked objects into free chunks Create a list of free chunks Compact (the disaster zone) Usually done when allocation request cannot be satisfied Move and coalesce the live objects, to create bigger free chunks Usually very long, unattractive
6
HRLHRL The MSC Process StacksGlobals Mark Compact ? Sweep Used Marked Free Dark matterUnmovable
7
HRLHRL MSC Subtle Issues Mark Proportional to amount of live objects only. Risk of mark stack Overflow (e.g., linked list with Next as last field) Avoid paging and cache misses Sweep Walk all the heap again Naive method is proportional to amount of live AND dead objects Partially solved by mark bit-vector Bit for each basic heap unit (typically 8 bytes) Object mapped to specific bit Walk the mark bit-vector, inspect heap only when found big holes in vector Compact Tradeoff : level of compaction Vs. speed If not type-accurate GC, not all objects can be moved Can't tell if a slot on stack is a reference, or a numeric value
8
HRLHRL Parallel STW MSC on N-Way Usually GC thread per processor Parallel mark Synchronized marking Load balancing needed (overflow, starvation) Separate mark stacks for each thread Stealing from designated "private" areas, attached to mark stacks (Endo et al) Direct access to mark stacks of other threads (Flood et al) Adding a single shared stack (Cheng & Blelloch) Parallel sweep Heap divided to M areas M > f * N Synchronization needed for area selection, and free list concatenation Compact Tradeoff : Parallelism Vs. auxiliary data structures
9
HRLHRL Introduction Principles of concurrent collector Dividing the concurrent work Parallel load balancing mechanism Coding issues Results (highlights) Recent work Outline
10
HRLHRL The Concurrent Collection Principle STW may get to seconds Cost of mark phase dominant Mark may be done while program is active, except that the objects graph changes... Correctness kept by use of Write barrier Is activated at each change of reference field in an object (gray it!) Functionality is algorithm-dependent, sometimes expensive
11
HRLHRL The Concurrent Collection What, Who, How Mostly concurrent MSC Tracing done while mutator threads are active (Boehm et al, Printezis & Detlefs) Retrace (Clean) may be done while mutator threads are active Short final STW Last clean and resulting tracing, and Sweep Originally done by separate thread/processor "Real" concurrent Incremental work Tracing done incrementally by the mutator threads First done on a copying collector (Baker) Parallel execution Concurrent phase is also parallel Many threads can do concurrent work simultaneously
12
HRLHRL The IBM Mostly Concurrent GC Putting together all existing elements First production-level parallel incremental mostly concurrent MSC collector Combining incremental and concurrent tracing Efficient concurrent marking that terminates on time New mechanism for parallel load balancing Especially fit for dynamic number of participating threads When compared to mature industry-quality GC Drastic reduction in pause time (more than 75%) Small throughput hit (~10%)
13
HRLHRL Phases of The Collector Concurrent phase Tracing of all reachable objects Done incrementally by Java mutators and dedicated low-priority tracing threads Write Barrier records changes per region (card) in a card table Any change of reference in an object dirty the card Alll black objects in card changed to gray Fast and relatively cheap operation (2% - 6% throughput hit) A single card cleaning pass In each dirty card, retrace all marked objects Cleaning may precede the actual tracing Final STW phase Root scanning and final card cleaning pass Tracing of all additional objects Parallel sweep Now replaced by concurrent sweep
14
HRLHRL CPU Distribution
15
HRLHRL Write Barrier Activated by the JVM, on each reference change done in Java Writes in a card table Each card 512 bytes in heap Cleaning (concurrent or final) may happen anytime Foo.a = O1 Store O1 in a root (guarantied to be reachable) Set Foo.a to O1 Activate Write Barrier on Foo Dirty the enrty of Foo in the card table Remove O1 from root Object may span on many cards Usually mapped to the card where its header starts
16
HRLHRL Introduction Principles of concurrent collector Dividing the concurrent work Parallel load balancing mechanism Coding issues Results (highlights) Recent work Outline
17
HRLHRL The Problem of Punctual termination Traditional STW Collection starts when heap is full produces minimal number of GCs Mostly concurrent aims at completing the Concurrent Marking when heap becomes full If heap get filled before CM terminates, rest of marking moved to final phase Longer pause If CM terminate before heap is filled, choose the lesser of two evils: Wait for heap to be filled (and accumulate more dirty cards) Initiate an "early" GC (more GCs, with all their additional costs) Concurrent marking should be adaptive
18
HRLHRL Combining Concurrent and Incremental Existing approaches Incremental Mutators do tracing proportional to allocation (Tracing Rate) Tracing guarantied to terminate (more or less) on time Decreases application performance Specialized GC threads (concurrent) More efficient, better CPU utilization Tracing rate determined by ratio between GC and program threads Usually not changed by the collector No control of termination point Hybrid usage of both ways Low priority background threads fully utilize CPU idle time Not controlled by the tracing rate Mutators perform incremental tracing, to ensure proper termination Only if tracing goals not met by background threads Control "Milestones" (concurrent start, card cleaning pass start, etc.)
19
HRLHRL Metering Formulas Kickoff point of concurrent User-specified Tracing Rate (TR) Live Objects estimation (L est ), Dirty objects estimation (M est ) Start concurrent when free space gets below (Lest+M est ) / TR So tracing rate, applied to remaining allocations, match the tracing work Calculating The work Amount of concurrently traced objects (T raced ) Amount of remaining free memory (F ree ) Estimated background threads tracing rate (B est ) Rate between total amounts of background tracing and allocations Dynamically recalculate actual rate ATR = ((L est + M est - T raced ) / F ree ) Rate between remaining work and free space Account for work done in background ATR 2 = ATR - B est Trace only if background tracing lags
20
HRLHRL Behavior Patterns CPU Usage Time Parallel STW Java mutation Incremental tracing Background tracing STW MSC GC Throughput 100% Con. Tr Rate 3 Throughput 80% Con. Tr Rate 8 Throughput 90% Con. Tr Rate 8 CPU 80% Throughput 95% Con. Tr Rate 8 CPU 50% Throughput 110% Time
21
HRLHRL Introduction Principles of concurrent collector Dividing the concurrent work Parallel load balancing mechanism Coding issues Results (highlights) Recent work Outline
22
HRLHRL Load Balancing - the Problem Even distribution of objects between parallel tracing threads Avoid mark stack overflow and/or starvation of threads Suitable for unknown number to collection threads Efficient synchronization Supply simple termination detection Existing approaches All use separate mark stacks for each thread Stealing from designated "private" areas, attached to mark stacks (Endo et al) Direct access to mark stacks of other threads (Flood et al) Adding a single shared stack (Cheng & Blelloch)
23
HRLHRL Load Balancing for Concurrent Pools of WorkPackets Each is a smaller mark stack Cheap Get/Put synchronization (compare & swap) Separate pools for different occupancy Each pool maintains a counter Tracing thread uses 2 WorkPacket Objects are popped from the input WP Newly marked object are pushed to the output WP Empty input WP is returned to the "Empty" pool New "as-full-as-possible" WP is then fetched Full output WP is returned to the "Full" pool New "as-empty-as-possible" WP is then fetched Different object graph traversal BFS, limited by the capacity of a WorkPacket Full Non-Full Non-Empty Empty
24
HRLHRL Advantages of WorkPackets Fair competition when input is scarce All threads get same chance for tracing input Simple detection of tracing state Overflow - All packets are full Scaleability is possible, simply allocate more WPs Starvation - Only empty WPs available, but not all WPs in the "Empty" list Termination - all WPs in the "Empty" list Positive results measured Low cost of synchronization Fair distribution of work among threads
25
HRLHRL Introduction Principles of concurrent collector Dividing the concurrent work Parallel load balancing mechanism Coding issues Results (highlights) Recent work Outline
26
HRLHRL Concurrent Code Maintenance Extremely difficult to verify: Races between concurrent tracing and program Races between concurrent tracers Timing is a major factor Debug version cannot reproduce Release bugs Problems surface only occasionally Behavior is machine dependent About 40% verification code Sanity checks Asserts, consistency checks Logging of collection activity, state, and history Shadow heap, for tracing history Shadow card table, for card state and treatment Code to use the above for printing detailed information.
27
HRLHRL Introduction Principles of concurrent collector Dividing the concurrent work Parallel load balancing mechanism Results (highlights) Recent work Outline
28
HRLHRL Comparison with STW GC Compared to STW MSC Using IBM's production level JVM 4-way machines NT, AIX and IA64 Mostly testing SPECjbb Server-side Java Throughput driven 60% live objects Pause time cut by 75% Mark time by 86%. Sweep become dominant Throughput hit of 10%
29
HRLHRL Comparison with STW GC (cont.) Also testing pBOB IBM internal benchmark Fit for 2.5 GB heap, with low CPU utilization Many threads
30
HRLHRL Effects of Various Tracing Rates Mutator utilization - amount of Java mutation done during the concurrent phase Also controls the size of the "per thread mini-STW"
31
HRLHRL Effects of Various Tracing Rates Floating garbage - marked objects that become unreachable before the final STW phase. Amount of cards cleaned
32
HRLHRL References A parallel, incremental and concurrent GC for servers. Ossia, Ben-Yitzhak, Goft, Kolodner, Leikehman, Owshanko. PLDI '02 Mostly parallel garbage collection. Boehm, Demers, Shenker. ACM SIGPLAN Notices, 1991. On-the-fly Garbage Collection: An exercise in cooperation. Dijkstra, Lamport, Martin, Scholten, Steffens. ACM comm., 1978. A generational mostly-concurrent garbage collector. Printezis, Detlefs. ISMM 2000 And many more...
33
HRLHRL Dividing the concurrent work Parallel load balancing mechanism Results (highlights) Recent work Introduction Outline
34
HRLHRL Concurrent Sweep Sweep became the dominant part of the remaining pause time Except for the needed initial allocation, rest of sweep can be deferred Concurrent sweep done incrementally After the final phase, and before the next concurrent collection Work done on each allocation request No additional performance cost
35
HRLHRL Improving the Low Tracing Rate Low tracing rate is more application friendly More CPU left to program Shorter tracing periods forced on threads But throughput is reduced Goal : Improve throughput with minimal hit on pause times Achieved by reducing dirty cards, and floating garbage. Better performance Reduced heap residency
36
End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.