Download presentation
Presentation is loading. Please wait.
2
Cache-Conscious Copying Collectors Written By: Henry J. Baker Presented By: Eliaz Tobias
3
Motivation n Garbage Collectors must minimize: Cache Space Off-Chip Communication Bandwidth Performance Optimization on modern Single-Chip Computer Architectures
4
n Modern processor chips are no longer “CPU-bound”, but “I/O-Bound” Introduction n Example: Multiplying Large Matrices [LAM 91] Performance has increased from 0.9 MFLOPS to 4 MFLOPS through careful management of the scarce resources
5
Introduction - Cont n Symbolic Processing - Also More I/O-Bound than CPU-Bound n Numeric Processing - I/O-Bound. There is a need to use ‘prefetching’.
6
Introduction - Cont n Control structures of the symbolic processing tend to be data-dependant. This fact limits the potential of prefetching strategies. n The prefetching strategy isn’t appropriate for today’s limited bandwidth processors. n Using GC can help us optimize the space & bandwidth management strategies.
7
What are we going to see ? n Cache Categories n History n Advanteges of the CGC’s & the NCGC’s n Strategies to overcome the drawbacks of Copying Garbage Collectors n Analysis of a Parallel GC system n Case Study - The Intel 80860XP architecture n Summary
8
n Fully Associative n Direct mapped n n-way set associative Cache Categories a block can be mapped anywhere a block can be mapped into 1 set ( of n lines size ) by the formula : a block can be mapped into 1 set ( of n lines size ) by the formula : (Block address)MOD(Number of sets in cache) a block can be mapped to exactly one cache line
9
History n Minsky 1963 The 1st copying garbage collector n Cheney 1970 The elegant 2-space model n Baker 1978 Real-Time version of the Cheiney algorithm
10
Non-Copying Garbage Collectors - Advantages n Less address space required Objects never occupy 2 different locations at the same time n Doesn’t move objects behind the compiler’s back Compiler optimizations are not invalidated n The smaller object space and object motion optimizes the performance on cache-based architectures.
11
Copying Garbage Collectors - Advantages n Trivial allocation of non-homogeneous objects, due to the compactness of the free area. n Can also be used to improve the locality for virtual memory / cache purposes. n Can expand/contract the amount of space under management more easily. n Single-Phase Algorithm n Simplicity of storage management under widely varying demands.
12
n Reminder –The cheiney algorithm uses a from-space and a to-space. –When the from-space is filled, a “stop & copy” is done. All live objects move to the to-space. –The copy is done using 2 pointers in the to-space: the scan & allocate pointers. –After the copy, the to-space & the from-space change parts. The Cheiney algorithm
13
Let’s overcome the Drawbacks of the Copying Garbage Collectors
14
n CDR-Coding n Large Objects are not copied n The Ghosting Argumnet n Forwarding Pointers n Forwarding Pointers - Pipeline n Parallel Garabage Collection The methods
15
CDR-Coding n CDR-coding is a space-saving way to store lists in memory n Usually used in lisp n The Idea: We won’t hold 2 pointers for each element in the list. We’ll hold a value & a 2-bit "CDR code". The CDR code may contain one of three values: CDR-NORMAL, CDR-NEXT, and CDR-NIL.
16
n The idea – Modifying the page map instead n What for ? – Saving copying overhead – Saving physical address space Large objects are not copied
17
n What do we need for this ? –System support of page maps which alias to different virtual address space –Large objects should not share pages with other objects n Improvement –In a properly designed system, the aliased cache locations are accessible without reloading Avoiding unnecessary overhead Large objects are not copied
18
Let’s examine the following case: –Cheiney’s algorithm is run twice in succession on exactly the same data. –Remember that on the 2nd time from & to switch places. –When an object is copied to to-space ( on the second time ), the access of the allocate pointer in to-space, is at the same relative location of the scan pointer in from-space. The Ghosting Argument
19
n There is a ghost/shadow of the allocate pointer in the from-space n As a result, the page/cache line of the allocate pointer will almost certainly be in memory Minimal cache misses n The ghosting argument works well even for other copying garbage collectors. The Ghosting Argument ( Cont )
20
n What for ? –The forwarding pointers are used for the ‘from-space’ and the ‘to-space’ to stay isomorphic after the copy - ‘Object identity’. n What if the GC won’t use them ? –The ‘to-space’ will consist of objects with only one reference Cyclic list structures Forwarding pointers
21
n Assumptions –Shared objects with more than one reference are much less predictable. –The GC needs a way to identify those relatively few shared objects. –Based on intuition & experience, it’s known that objects with 1 reference are functional objects. –The assumption for preserving the forwarding pointers is not relevant for these objects. Forwarding pointers
22
n Conclusion 1: We can get rid of the forwarding pointers for the functional objects. n Problem: There are cases in which some functional objects, have more than 1 reference. Thus, we will use more memory space & copying overhead than needed. n Solution: Efficient & incremental calculation for functional objects by the GC, whether it’s better to copy or share the object. Forwarding pointers
23
n Conclusion 2: If a substantial amount of the objects are functional, than the GC can be made more efficient. Forwarding pointers n What about the non-functional objects ? –Permanent objects or objects referenced from the ‘outside’, cannot become garbage. We can eliminate the copying and forwarding of these non-functional objects.
24
n What about the temporary, shared, non-functional objects ? –If all the above optimizations were implemented, then most of the cache misses would be of forwarding pointers. –The cache stores mainly forwarding pointers now. –Solution: Maximizing the density of the forwarding pointers in the cache. –We can also use a separate table for the forwarding pointers. Forwarding pointers
25
n The retrieval of the forwarding pointers is a limit to the traditional Cheiney CGC. n If the bottleneck is the latency and not the bandwidth, one can perform a number of forwarding pointer lookups in a pipelined fashion. n Appel uses this scheme in his vectorized CGC in 1989. Forwarding pointers - Pipeline
26
Parallel Garbage Collectors
27
n Problem: The CGC increases the traffic to memory, and pushes out mutator information from the cache/main memory. Parallel Garbage Collection n If the CGC is implemented using another processor, those problems can be solved. n Let’s focus on the RTCGC scheme that Baker suggested in 1978.
28
n The motivation is to let the mutator processor work, even though the collector processor still copies objects from from-space to to-space. n Let’s consider that the mutator is stopped during the flip ( the switch of from-space with to-space ), in order for the cache to be invalidated. Parallel Garbage Collection
29
n Problem: If the mutator, while working on to- space, finds a pointer to from-space ? –It can move the object by himself, accessing from-space. –This leads to a miss in the mutator’s cache. –As a result, information important to the mutator is removed from the cache. Parallel Garbage Collection
30
Questions n If the mutator finds no forwarding pointer from the object, then it must copy it ( or should the collector do it ?). n Should the mutator use an allocate pointer of his own, or gain access to the collector’s allocate pointer ? Parallel Garbage Collection
31
n What if the architecture supports a way to bypass the change of the mutator’s cache, while the mutator accesses the from-space? Parallel Garbage Collection n How does the Baker RTCGC works ? –Then the mutator tell the collector to do the copying of the object using a mailbox (containing the address of the desired object). –The mutator can work without synchronizing with the collector, until a forwarding pointer is not found, when accessing from-space.
32
How does the Baker RTCGC work (cont) ? –The collector checks this mailbox in every cycle. –The collector finds the object address in its mailbox. –Then, the collector transports it to to-space. –Then, it installs the forwarding pointer. –When the collector finishes, it sends a message using a special mailbox to the mutator, telling him that he is going to sleep. –The mutator is responsible of waking the collector. Parallel Garbage Collection
33
Case Study - The Intel 80860XP Architecture n Modern 64-bit RISC Processor n 32-bit Virtual Address Space mapped to 32-bit Physical Address Space n Instructions for Load/Store with Quad-Word Blocks ( which take more than 1 cycle ) n 16KB on-chip cache, can be extended into an off-chip 256KB/512KB SRAM cache n The cache is 4-way set-associative Characteristics
34
Case Study - The Intel 80860XP Architecture n Contains pipelined ‘Load’ instructions which bypass the data cache n Contains ‘Store’ instructions which write to the memory, or to the cache if a cache-hit occurred. They don’t use new cache space n Uses the MESI cache coherency protocol, based on global bus and “snooping” Characteristics
35
Case Study - The Intel 80860XP Architecture Analysis n The 64-bit bus & the quad-words usage in the load/store instructions, provide substantial bandwidth to support a CGC n The cache-bypassing capability of the load instruction, operates without damaging the cache locality of the mutator n The 4-way set associativity of the cache avoids the copy-thrashing behavior of the direct mapped cache n The advanced cache coherency protocol should allow a dedicated GC processor to reduce the load on the mutator processor
36
Summary n We have examined the interaction of the CGC with the memory system of modern RISC architectures. n We can easily overcome the drawbacks of copying collectors, especially with cache bypassing mechanisms. n Modern cache coherency protocols support parallel garbage collection rather nicely.
37
Summary - Cont n We studied the Intel 80860XP architecture. n We found it is very suitable for copying garbage collection, in both single-processor and multi-processor configurations. n We learned of methods like CDR-Coding, Uncopied Large Objects, Forwarding pointers economization, pipelines and parallel CGC’s.
38
Questions…
39
The End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.