A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao
Outline Introduction Background and definitions Theoretical algorithm Extended algorithm Evaluation Conclusion
Introduction First garbage collectors: – Non-incremental, non-parallel Recent collector – Incremental – Concurrent – Parallel
Introduction Scalably parallel and real-time collector – All aspects of the collector are incremental – Parallel Arbitrary number of application and collector threads – Tight theoretical bounds on Pause time for any application Total memory usage – Asymptotically but not practically efficient
Introduction Extended collector algorithm – Work with generations – Increase the granularity of the incremental steps – Separately handle global variables – Delay the copy on write – Reduce the synchronization cost of copying small objects – Parallelize the processing of large objects – Reduce double allocation during collection – Allow program stacks
Background and Definitions A semispace Stop-Copy Collector – Divide heap memory into two equally-sized From-space and to-space – Suspend mutator and copy reachable objects to the to-space when from-space is full – Update root values and reversing the role of from- space and to-space
Background and Definitions Types of Garbage Collectors
Background and Definitions Type of Garbage Collector (continued)
Background and Definitions Real-time Collector – Maximum pause time – Utilization The fraction of time that the mutator executes – Minimum Mutator Utilization A function of window size Minimum utilization at all windows of that size = 0 when window size <= maximum pause time
Theoretical Algorithm A Parallel, incremental and concurrent collector – Base on Cheney’s simple copying collector – All objects are stored in a shared global pool of memory – Two atomic instruction FetchAndAdd CompareAndSwap – Collector interfaces with the application Allocating space for a new object Initializing the fields of a new object Modifying the field of an existing object
Theoretical Algorithm Scalable Parallelism – Maintain the set of gray objects – Cheney’s technique Keeping them in contiguous locations in to-space Pros – Simple Cons – Restricts the traversal order to breadth-first – Difficult to implement in a parallel setting
Theoretical Algorithm Scalable Parallelism (continued) – Explicitly managed local stack Each processor maintains a stack A shared stack of gray objects Periodically transfer gray objects between local and shared stack Avoid idleness – Pushes (or pops) can proceed in parallel Reserve a target region before transfer Pushes and pops are not concurrent Room sychronization
Theoretical Algorithm Scalable Parallelism (continued) – Avoid white objects being copied twice Exclusive access by atomic instructions Copy-copy synchronization
Theoretical Algorithm Incremental and Replicating Collection – Baker’s incremental collector Copy k units of data when allocate a unit of data – Bound the pause time Mutator can only see copied objects in to-space – A read barrier is needed – Modification to avoid the read barrier Mutator can only see the original objects in from-space – A write barrier is needed
Theoretical Algorithm Concurrency – Program and collector execute simultaneously – Program manipulate primary memory graph – Collector manipulate replica graph – A copy-write synchronization is needed Replica objects should be modified correspondently Avoid race condition – Mark objects being copied – Mutator’s update to replica should be delay – A write-write synchronization is needed Prohibit different mutator threads from modifying the same memory location concurrently
Theoretical Algorithm Space and Time Bounds – Time bounds on each memory operation ck – C : a constant – K: the number of words we collect per word allocated – Space bounds 2(R(1+1.5/k)+N+5PD) ≈ 2(R(1+1.5/k) – R: reachable space – N: maximum object count – P: P-way multiprocessor – D: maximum memory graph depth
Extended Algorithm Globals, Stacks and Stacklets – Globals Updated when collection ends Arbitrary many -> unbound time Replicate globals like other heap objects Every global has two location A single flag is used for all globals – Stacks and Stacklets Divided stacks into fixed-size stacklets At most one stacklet is active and the other can be replicated savely Also bound the waste space per stack
Extended Algorithm Granularity – Block Allocation and Free Initialization Avoid calling FetchAndAdd for every memory allocation Each processor maintain a local pool in from-space and a local pool in to-space when collector is on Using a FetchAndAdd when allocating a local pool – Write Barrier Avoid updating copied objects every time Record a triple in a write log and defer Invoke the collector when the write log is full Eliminating frequent context switches
Extended Algorithm Small and Large Objects – Original Algorithm One field at a time – Reinterpretation of the tag word – Transferring the object from and to the local stack – Extended Algorithm Small objects – Locked down and copied at a time Large objects – Divided into segments – One segment at a time
Extended Algorithm Algorithmic Modifications – Reducing double allocation One allocation by mutator and one by collector Deferring the double allocation – Rooms and Better Rooms A push room and a pop room Only one room can be non-empty Rooms – Enter the pop room, fetch work and perform, transition to the push room, push objects back to the shared stack – Graying objects is time-consuming – Wait for entering the push room
Extended Algorithm Algorithm modifications – Rooms and Better Rooms (continued) Better rooms – Leave the pop room after fetching work from shared stack – Detect the shared stack is empty by maintaining a borrow counter – Generational Collection Nursery and tenured space Trigger a minor collection when nursery space is full Trigger a major collection when tenured space is full Tenured references might not be modified during collection Hold two fields for mutable pointer – one for mutator to use, the other for collector to update
Evaluation
Conclusion Implements a scalably parallel, concurrent, real-time garbage collector Thread synchronization is minimized