380C Where are we & where we are going – Managed languages Dynamic compilation Inlining Garbage collection What else can you do when you examine the heap a lot? – Why you need to care about workloads – Alias analysis – Dependence analysis – Loop transformations – EDGE architectures 1
2 380C lecture 18 Garbage Collection – Why use garbage collection? – What is garbage? Reachable vs live, stack maps, etc. – Allocators and their collection mechanisms Semispace Marksweep Performance comparisons Mark Region – Incremental age based collection Write barriers: Friend or foe? Generational Beltway
Mark Region and Other Advances in Garbage Collection Kathryn S. McKinley Stephen M. Blackburn University of Texas at Austin Australian National University PLDI’08: Immix: A Mark-Region Collector With Space Efficiency, Fast Collection, and Mutator Performance
Isn’t GC a bit retro? 4 “Languages without automated garbage collection are getting out of fashion. The chance of running into all kinds of memory problems is gradually outweighing the performance penalty you have to pay for garbage collection.” Paul Jansen, managing director of TIOBE Software, in Dr Dobbs, April 2008 “Languages without automated garbage collection are getting out of fashion. The chance of running into all kinds of memory problems is gradually outweighing the performance penalty you have to pay for garbage collection.” Paul Jansen, managing director of TIOBE Software, in Dr Dobbs, April 2008 Mark-Compact Styger, 1967 Mark-Sweep McCarthy, 1960 Semi-Space Cheney, 1970
GC Fundamentals The Time–Space Tradeoff 5
6 Our Goal
GC Fundamentals Algorithmic Components AllocationReclamation 7 Identification Bump Allocation Free List ` Tracing (implicit) Reference Counting (explicit) Sweep-to-Free Compact Evacuate 31
Mark-Compact [Styger 1967] Bump allocation + trace + compact GC Fundamentals Canonical Garbage Collectors 8 ` Sweep-to-Free Compact Evacuate Mark-Sweep [McCarthy 1960] Free-list + trace + sweep-to-free Semi-Space [Cheney 1970] Bump allocation + trace + evacuate
Mark-Sweep Free List Allocation + Trace + Sweep-to-Free 9 Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo ✓ ✓ Space efficient ✓ ✓ Simple, very fast collection Poor locality
10 Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo ✓ ✓ Space efficient Mark-Compact Bump Allocation + Trace + Compact Expensive multi-pass collection ✓ ✓ Good locality Good locality
Semi-Space Bump Allocation + Trace + Evacuation 11 Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo ✓ ✓ Good locality Space inefficient
Mark-Region with Sweep-To-Region 12 ` Sweep-to-Free Compact Evacuate Reclamation Sweep-to-Region Mark-Sweep Free-list + trace + sweep-to-free Mark-Compact Bump allocation + trace + compact Semi-Space Bump allocation + trace + evacuate Mark-Region Bump + trace + sweep-to-region
Mark-Region Bump Allocation + Trace + Sweep-to-Region 13 ✓ ✓ Simple, very fast collection ✓ ✓ Space efficient ✓ ✓ Good locality Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo ✓ ✓ Excellent performance Excellent performance
Naïve Mark-Region 14 Contiguous allocation into regions Excellent locality – For simplicity, objects cannot span regions Simple mark phase (like mark-sweep) – Mark objects and their containing region Unmarked regions can be freed 0 0
Immix Efficient Mark-Region Garbage Collection 15
Lines and Blocks 16 Small Regions Large Regions ✗ Fragmentation (can’t fill blocks) ✓ More contiguous allocation ✗ Fragmentation (false marking) Lines & Blocks N pagesapprox 1 cache line ✓ Less fragmentation Objects span lines ✓ Fast common case Lines marked with objects ✗ Increased metadata o/h ✗ Constrained object sizes 0 0 TLB locality, cache locality Block > 4 X max object size Free Recyclable lines
Allocation Policy (Recycling) 17 Recycle partially marked blocks first Minimizes fragmentation Maximizes sharing of freed blocks Recycle in address order – We explored other options Allocate into free blocks last
Opportunistic Defragmentation Identify source and target blocks – (see paper for heuristics) Evacuate objects in source blocks – Allocate into target blocks Opportunistic – Leave in place if no space, or object pinned Opportunistically evacuate fragmented blocks – Lightweight, uses same allocation mechanism – No cost in common case (specialized GC)
Other Optimizations 19 Implicit Marking ✓ Most objects small Small objects implicitly mark next line ✓ V. Fast common case Large objects mark lines exactly Implicit line mark Line mark Overflow Allocation Multi-line objects may skip many small holes Overflow allocation (used on failure) ✓ Large objects uncommon ✓ V. effective solution ✓ ✓
Results Complete data available at: 20
Evaluation 20 Benchmarks Hardware 21 Collectors ` Methodology DaCapo SPECjvm98 SPEC jbb2000 MMTk Jikes RVM (Perf ≈ HotSpot 1.5) Replay compiler Discard outliers Report 95 th %ile Full Heap Immix MarkSweep MarkCompact SemiSpace Generational GenIX GenMS GenCopy Sticky StickyIX StickyMS Core 2 Duo 2.4GHz, 32KB L1, 4MB L2, 2GB RAM AMD Athlon GHz, 64KB L1, 512KB L2, 2GB RAM PowerPC GHz, 32KB L1, 512KB L2, 2GB RAM Please see the paper for details.
Mutator Time 22 Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo
Minimum Heap 23
GC Time 24 Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo
Total Performance 25 Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo
Generational Performance 26 Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo
Sticky Performance 27 Geomean of DaCapo, jvm98 and jbb2000 on 2.4GHz Core 2 Duo
PseudoJBB On 2.4GHz Core 2 Duo
PseudoJBB On 2.4GHz Core 2 Duo
Prior Work IBM product collector – Mark-Region not characterized – Collector not evaluated – Product and basis for other research [Domani et al 2000][Kermany & Petrank 2006] 30
Mark-Region Collection 31 ` Sweep-to-Free Compact Evacuate Mark-Sweep Free-list + trace + sweep-to-free Mark-Compact Bump allocation + trace + compact Semi-Space Bump allocation + trace + evacuate Mark-Region Bump allocation + trace + sweep-to-region Sweep-to-Region
Immix Efficient Mark-Region Collection 32 ✓ ✓ Simple, very fast collection ✓ ✓ Space efficient ✓ ✓ Good locality Actual data, taken from geomean of DaCapo, jvm98, and jbb2000 on 2.4GHz Core 2 Duo ✓ ✓ Excellent performance Excellent performance
Open Source Code available in JikesRVM onward. Complete data available at: 33
Research History PLDI 1998 – Clinger & Hanson postulated the radioactive decay model for object lifetimes Genesis of Older-First – [Stefanovic, McKinley, Moss OOPSLA’99] 34
Garbage Collection Hypotheses Generational hypothesis: younger objects die quickly, so collect them first Older-first hypothesis: the collector can collect less the longer it waits 35 Survival function s(v) for object lifetime distribution younger older 0 1/2V V Age ordered heap s(v)
Older-first Algorithm 36
Next Steps Beltway – [BJMM PLDI’02] – Increments – Belts – Combines generational and older-first Ulterior Reference Counting – [BM OOPSLA’03] – Reference count on-per-object basis – Responsiveness and throughput MMTk : [BCM SIGMETRICS’04 ICSE’04] – Toolkit for building & understanding GC – Motivated today’s work
Garbage Collection is the Answer to All Your Problems Improves data and code locality – [Huang et al. OOPSLA’02 ISMM’04, VEE’04] Cooperative GC optimizations – Colocation [Guyer OOPSLA’05] – Free-me [Guyer et al. PLDI’06] Finds leaks – [Bond ASPLOS’06, Jump POPL’07] Tolerates leaks – [Bond OOSLA’08] Helps with dynamic software updating! – [Subramaniam, Hicks ??’08] DaCapo Benchmarks – [Blackburn et al. OOPSLA’06 CACM’08] 38
380C Where are we & where we are going – Why you need to care about workloads – Managed languages Dynamic compilation Inlining Garbage collection – Opportunity to improve data locality on-the-fly – Read: X. Huang, S. M. Blackburn, K. S. McKinley, J. E. B. Moss, Z. Wang, and P. Cheng, The Garbage Collection Advantage: Improving Program Locality, ACM Conference on Object Oriented Programming, Systems, Languages, and Applications (OOPSLA), pp , Vancouver, Canada, October – Alias analysis – Dependence analysis – Loop transformations – EDGE architectures