Portable, mostly-concurrent, mostly-copying GC for multi-processors Tony Hosking Secure Software Systems Lab Purdue University
Platform assumptions Symmetric multi-processor (SMP/CMP) Multiple mutator threads (Large heaps)
Desirable properties Maximize throughput Minimize collector pauses Scalability
Exploiting parallelism Avoid contention (Mostly-)Concurrent allocation (Mostly-)Concurrent collection
Concurrent allocation Use thread-private allocation “pages” Threads contend for free pages Each thread allocates from its own page multiple small objects per page, or multiple pages per large object
Concurrent collection: The tricolour abstraction Black “live” scanned cannot refer to white Grey “live” wavefront still to be scanned may refer to any color White hypothetical garbage
Garbage collection White = whole heap Shade root targets grey While grey nonempty Shade one grey object black Shade its white children grey At end, white objects are garbage
Copying collection Partition white from black by copying Reclaim white partition wholesale At next GC, “flip” black to white
Mutator threads Incremental collection
Mutator threads Concurrent collection Background GC thread
Concurrent mutators Mutation changes reachability during GC Loss of black/grey reference is safe Non-white object losing its last reference will be garbage at next GC New reference from black to white New reference may make target live Collector may never see new reference Mutations may require compensation
Compensation options Prevent mutator from creating black-to- white references write barrier on black read barrier on grey to prevent mutator obtaining white refs Prevent destruction of any path from a grey object to a white object without telling GC write barrier on grey
Mostly-copying GC [Bartlett] Copying collection with ambiguous roots Uncooperative compilers Untidy references Explicit pinning Pin ambiguously-referenced objects Shade their page grey without copying Assume heap accuracy Copy remaining heap-referenced objects
Incremental MCGC [DeTreville] Enforce grey mutator invariant –STW greys ambiguously-referenced pages –Read barrier on grey using VM page protection Read barrier –Stop mutator threads –Unprotect page –Copy white targets to grey –Shade page black –Restart threads Atomic system call wrappers unprotect parameter targets (otherwise traps in OS return error)
Concurrent MCGC? Stopping all threads at each increment is prohibitive on SMP & impedes concurrency BUT barriers difficult to place on ambiguous references with uncooperative compilers ALSO Preemptive scheduling may break wrapper atomicity
Mostly-concurrent MCGC Enforce black mutator invariant STW blackens ambiguously-referenced pages Read barrier on load of accurate (tidy) grey reference Read barrier: Blacken grey references as they are loaded No system call wrappers: arguments are always black
Read barrier on load of grey Object header bit marks grey objects Inline fast path checks grey bit in target header, calls out to slow path if set Out-of-line slow path: Lock heap meta-data For each (grey) source object in target page Copy white targets to grey Clear grey header bit Shade target page black Unlock heap meta-data
Coherence for fast path STW phase synchronizes mutators’ views of heap state Grey bits are set only in newly-copied objects (ie, newly-allocated grey pages) since most recent STW Mutators can never see a cleared grey header unless the page is also black Seeing a spurious grey header due to weak ordering is benign: slow path will synchronize
Implementation Modula-3: gcc-based compiler back-end No tricky target-specific stack-maps Compiler front-end emits barriers M3 threads map to preemptively-scheduled POSIX pthreads Stop/start threads: signals + semaphores, or OS primitives if available Simple to port: Darwin (OS X), Linux, Solaris, Alpha/OSF
Experiments Parallelized GCOld benchmark to permit throughput measurements for multiple mutators Measures steady-state GC throughput 2 platforms: 2 x 2.3GHz PowerPC Macintosh Xserve running OS X x 700MHz Intel Pentium 3 SMP running Linux 2.6
Read Barriers: STW 1 user-level mutator thread, work=1
Elapsed time (s) 1 system-level mutator thread, work=1
Heap size 1 system-level mutator thread
BMU 1 system-level mutator thread, work=1000, ratio=1
Scalability work=1000, ratio=1, 8xP3
Java Hotspot server work=1000, 8xP3
Conclusions Mostly-concurrent,mostly-copying collection is feasible for multi-processors (proof-of- existence) Performance is good (scalable) Portable: changes only to compiler front-end to introduce barriers, and to GC run-time system Compiler back-end unchanged: full-blown optimizations enabled, no stack-map overheads
Future work Convert read barrier to “clean” only target object instead of whole page
BMU 1 system-level mutator thread, work=10, ratio=1
Scalability work=10, ratio=1, 8xP3
Java Hotspot server work=10, 8xP3