Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar.

Similar presentations


Presentation on theme: "Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar."— Presentation transcript:

1

2 Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar

3 The End!

4 The general problem CPU’s are getting fast faster and faster Main memory speed lags behind Result: The cost to access main memory is increasing

5 Solutions Hardware and software techniques: – Memory hierarchy – Prefetcing – Multithreading – Non-blocking caches – Dynamic instruction scheduling – Speculative execution

6 Great Solutions? Complex hardware and compilers Ineffective for many programs Attack the manifestation (= memory latency) and not the source (=poor reference locality) Not exactly…

7 Previous work Improving cache locality in dense matrices using loop transformation Other profile-driven, compiler directed approach

8 The GC problem Little temporal locality. Each live object is usually read only once during mark phase. Most reads are likely to miss. The new contents are unlikely to be used more than once.

9 The GC problem – cont. The sweep phase, like the mark phase, also touches each object once That’s since the free list pointers are maintained in the objects themselves Unlike the mark phase, the sweep phase is more sequential

10 The GC problem – cont. The sweep is less likely to use cache contents left by the marker The allocator is likely to miss again, when the object is allocated

11 The GC problem - previous work Older work concentrated on paging performance. Memory size increase lead to abandoning this goal. But memory size also lead to huge cache miss penalties. The largest cache size < heap size This problem is unavoidable.

12 Previous work Reducing sweep time for a nearly empty heap Compiler-based prefetching for recursive data structures

13 How am I going to improve the situation? Do some magic! Well no… Use real-time information to improve program cache locality. The mark and sweep phases offers invaluable opportunities for improvements – Bring objects earlier to the cache – Reuse freed objects for reallocation

14 Some numbers Relative to copy GC – Cache miss rates reduced by 21-42% – Program performance improved by 14-37% Relative to a page level GC – Cache miss rates reduced by 20-41% – Program performance improved by 18-31%

15 Road map Cache conscious data placement using generational GC – Overview – Short generational GC reminder – Real-time data profiling – Object affinity graph – Combining the affinity graph with GC – Experimental evaluation Other methods and their experimental results

16 Overview A program is instrumented to profile its access patterns The data is used in the same execution and not the next one. The data -> affinity graph A new copy algorithm uses the graph to layout the data while copying.

17 Generational GC – A reminder The heap is divided to generations GC activity concentrates on young objects, which typically die faster. Objects that survive one or more scavenges are moved to the next generation

18 Implementation notes The authors used the UM GC toolkit The toolkit has several steps per generation The authors used a single step for each generation for simplicity. Each step consists of fixed size blocks The blocks are not necessarily contiguous in memory

19 Implementation notes - steps

20 The steps are used to encode the objects’ age An object which survives a scavenge is moved to the next step

21 Implementation notes – moving between generations The scavenger collects a generation g and all its younger generations It starts from objects that are: – In g – Reachable from the roots. Moving an object is copying it into a TO space. The FROM space can be reused

22 Copying algorithm – a reminder Cheney’s algorithm TO and FROM spaces are switched Starts from the root set Objects are traversed breadth-first using a queue Objects are copied to TO space Terminates when the queue is empty

23 Copying algorithm – the queue trick

24 The algorithm

25 Did you get it?

26 Real time data profiling Earlier program run profile is not good enough Real time data eliminates: – Profile execution run – Finding inputs But the overhead must be low! Great!

27 Profiling data access patterns Trace every load and store to heap Huge overhead (factor of 10!)

28 Reducing overhead Use object oriented programs properties 1. Most objects are small, often less than 32 bytes – No need to distinguish between fields, since cache blocks are bigger

29 Reducing overhead – cont. 2. Most object accesses are not lightweight – Profiling instrumentation will not incur large overhead Don’t believe? Stay awake

30 Collecting profiling data “Load”s of base object addresses Uses a modified compiler The compiler retains object type information for selective loads

31 Code instrumentation

32 Collecting profiling data - cont The base object address is entered to an object access buffer

33 Implementation note Uses a page trap for buffer overflow A trap causes a graph to be built Recommended buffer size: 15000 (60KB)

34 Affinity? Main Entry: af·fin·i·ty Pronunciation: &-'fi-n&-tE Function: noun Inflected Form(s): plural -ties Etymology: Middle English affinite, from Middle French or Latin; Middle French afinité, from Latin affinitas, from affinis bordering on, related by marriage, from ad- + finis end, border Date: 14th century 1 : relationship by marriage 2 a : sympathy marked by community of interest : KINSHIP b (1) : an attraction to or liking for something (2) : an attractive force between substances or particles that causes them to enter into and remain in chemical combination c : a person especially of the opposite sex having a particular attraction for one 3 a : likeness based on relationship or causal connection b : a relation between biological groups involving resemblance in structural plan and indicating a common origin KINSHIP

35 The object affinity graph

36 Nodes – objects Edges – Temporal affinity between objects An undirected graph

37 Building the graph

38 Inserting an object to the queue

39 Incrementing edges’ weight

40 All clear?

41 Demonstration A B A C D D A Locality queueObject access bufferGraph Queue tail

42 Demonstration A B A C D D A Locality queueObject access bufferGraph A Queue tail

43 Demonstration A B A C D D A Locality queueObject access bufferGraph Queue tail A B 1

44 Demonstration A B C D D A Locality queueObject access bufferGraph Queue tail A B 12

45 Demonstration A B C D D A Locality queueObject access bufferGraph Queue tail A B 2 C 1 1

46 Demonstration A C D D A Locality queueObject access bufferGraph Queue tail A B 2 C 1 1 D 1 1

47 Demonstration A C D A Locality queueObject access bufferGraph Queue tail A B 2 C 1 1 D 1 1

48 Demonstration A C D A Locality queueObject access bufferGraph Queue tail A B 2 C 1 1 D 2 2

49 Demonstration C D A Locality queueObject access bufferGraph Queue tail A B 2 C 1 1 D 2 2

50 Demonstration C D A Locality queueObject access bufferGraph Queue tail A B 2 C 2 1 D 2 3

51 Implementation notes A separate affinity graph is built for each generation, except the first. It uses the fact that the object generation is encoded in its address. This method prevents placing objects from different generations in the same cache block. (Explanations later on)

52 Implementation notes – queue size The locality queue size is important Too small -> Miss temporal relationships Too big -> huge graph, long processing time Recommended: 3.

53 Implementation notes Re-create or update the graph? Depends on the application – Access phases should re-create – Uniform behavior should update In this article – re-create before each scavange

54 Stop! Our goal: Produce a cache conscious data layout, so that objects are likely to reside in the same cache block In English: place objects with high temporal affinity next to each other. The method: Use the profiling information we’ve collected in the copying process.

55 GC + Real-time profiling Use the object affinity graph in the Copying algorithm.

56 Example – object affinity graph

57 Example – before step 1

58 Step 1 – using the graph Flip roles (TO and FROM) Initialize free and unprocessed to the beginning of the TO space. Pick a node that is in: – The root set – and – the affinity graph and has the highest edge weight Perform a greedy DFS on the graph

59 Step 1 – cont. Copy each visited object to the TO space Increment the free pointer Store a forwarding address in the FROM space

60 Example – After step 1

61 Step 2 – continues Cheney’s way Process all objects between the unprocessed and the free pointers, as in Cheney’s algorithm

62 Example – After step 2

63 Step 3 - cleanup Ensure all roots are in the TO space If not, process them using Cheney’s algorithm

64 Example – After step 3

65 Implementation notes The object access buffer can be used as a stack for the DFS

66 Inaccurate results(?) The object affinity graph may retain objects not reachable = garbage They will be incorrectly promoted at most once Efforts are focused on longer lived objects and not on the youngest generation

67 Experimental evaluation Methodology – If we have the time Object oriented programs manipulate small objects Real-time data profiling overhead The algorithm impact on performance

68 Size of heap objects

69 But that’s not the point! Small objects often die fast

70 Surviving heap objects

71 Real-time data profiling overhead

72 Overall execution time

73 Overall execution time - notes No impact on L1 cache because its blocks are 16B

74 Compared to WLM algorithm

75 Comparison notes WLM (Wilson-Lam-Moher) improves program’s virtual memory locality. It performed worse or close to Cheney’s because of the 2GB memory

76 What else?

77 Other methods Two methods that can be used with the previous one – Prefetch on grey – Lazy sweeping

78 Assumptions Non moving mark-sweep collector For simplicity, the collector segregates objects by size. Each block contains objects of a single size The collector’s data structure are outside the user-visible heap A mark bit is reserved for each word in the block

79 Advantages of “outside the heap” data The mark phase does not need to examine (=bring to the cache) pointer-free objects Sequences of small unreachable objects can be reclaimed as a group – A single instruction is needed to examine their sequence of mark bits – It is used when a heap block turns out to be empty

80 The mark phase – a reminder Ensure that all objects are white. Grey all objects pointed to by a root. while there is a grey object g – blacken g – For each pointer p in g if p points to a white object – grey that object.

81 The mark phase – colors 1 mark bit – 0 is white – 1 is grey/black Stack – In the stack – grey – Removed from stack - black

82 The mark GC problem A significant fraction of time is spent to retrieve the first pointer p from each grey object About third of the marker’s execution time is spent This time is expected to increase on future machines

83 Prefetching A modern CPU instruction A program can prefetch data into the cache for future use

84 Prefetching – cont. But object reference must be predicted soon enough For example, if the object is in main memory, it must be prefetched hundred of cycles before its use Prefetching instructions are mostly inserted by compiler optimizations

85 Prefetch on grey When? Prefetch as soon as p is found likely to be a pointer What? Prefetch the first cache line of the object

86 To improve performance The last pointer to be pushed on the mark stack is prefetched first It minimizes the cases in which a just grayed object is immediately examined

87 And to improve more Prefetch a few cache lines ahead when scanning an object It helps with large objects It prefetches more objects if it isn’t that large

88 The sweep GC problem If (reclaimed memory > cache size) – Objects are likely to be evicted from the cache by the allocator or mutator Thus, the allocator will miss again when reusing the reclaimed memory

89 Lazy sweeping Originally used to reduce page faults Delay the sweeping for the allocator Pages will be reused instead of evicted from the cache

90 A reminder A mark bit is saved for each word in a cache block. A mark bit is used only if its word is the beginning of an object

91 Cache lazy sweeping – the collector Scans for each block its mark bits If all bits are unmarked, the block is added to the free blocks’ pool without touching it If some bits are marked, it’s added to a queue of blocks waiting to be swept There are several queues, one or more for each object size

92 Cache lazy sweeping – the allocator Maps the request to the appropriate object free list Returns the first object from the list If the list is empty – It sweeps the queue of the right size for a block with some available objects

93 Experimental results Measured on two platforms Second platform is to get some calibration on architecture variation

94 Pentium III/500 results

95 HP PA-8000/180 based results

96 Results conclusions Prefetch on grey eliminates a third to almost all cache miss overhead in the marker. But it is dependent on data structures used in the program

97 Results conclusions – cont. Collector performance is determined by the marker The sweep performance is architecture dependent

98 Conclusions Be concerned about cache locality or Have a method that does it for you

99 Conclusions – cont. Real-time data profiling is feasible Produce cache conscious data layout using that information May help reduce the performance gap between high-level to low-level languages

100 Conclusions – cont. Prefetch on grey and lazy sweeping are cheap to implement and should be in future garbage collectors

101 Bibliography Using Generational Garbage Collection To Implement Cache-Conscious Data Placement - Trishul M. Chilimbi and James R. Larus Reducing Garbage Collector Cache Misses - Hans-J. Boehm

102 Further reading Look at the articles Garbage collection – algorithms for automatic dynamic memory management – Richard Jones & Rafael Lins

103 Further reading – cont. Cecil – – Craig Chambers. “Object-oriented multi-methods in Cecil.” In Proceedings ECOOP’92, LNCS 615, Springer-Verlag, pages 33–56, June 1992. – Craig Chambers. “The Cecil language: Specification and rationale.” University of Washington Seattle, Technical Report TR-93-03- 05, Mar. 1993. Hyperion by Dan Simmons

104

105 Items Large objects Inter-generational objects placement Why explicitly build free lists? Experimental methodology Second experimental methodology

106 Large objects Ungar and Jackson : – There’s an advantage from not copying large objects (>= 256 bytes) with the same age A large object is never copied Each step has an associated set of large objects

107 Large objects – cont. A large object is linked in a doubly linked list. If it survives a collection, it’s removed from its list and inserted to the TO space list. No compaction is done on large objects.

108 Large objects – cont. Read more in David Ungar and Frank Jackson. “An adaptive tenuring policy for generation scavengers.” ACM Transactions on Programming Languages and Systems, 14(1):1–27, January 1992

109 Two generations, one cache block How important is co-location of inter- generation objects? The way to achieve this is to demote or promote.

110 Two generations, one cache block – cont. Intra-generation pointers are not tracked. In order to demote safely, it’s needed to collect its original generation Result: Long collection time

111 Two generations, one cache block – cont. Promote can be done safely – The young generation is being collected and its pointers updated – Pointers from old to young are tracked The locality benefit will start only when the old generation is collected Premature promotion

112 Why explicitly build free lists? Allocation is fast Heap scanning for unmarked objects can be fast using mark bits Little additional space overhead is required

113 Experimental methodology Vortex compiler infrastructure Vortex supports GGC only for Cecil Cecil – A dynamically typed, purely object- oriented language. Used Cecil benchmarks Repeated each experiment 5 times and reported the average

114 Cecil benchmarks

115 Cecil benchmarks – cont. Compiled at highest (o2) optimization level

116 The platform Sun Ultraserver E5000 12 167Mhz UltraSPARC processors 2GB memory – To prevent page faults Solaris 2.5.1

117 The platform - memory L1 – 16KB direct-mapped, 16B blocks L2 – 1MB unified direct-mapped, 64B blocks 64 entry iTLB and 64 entry dTLB, fully associative

118 The platform – memory costs L1, data cache hit – 1 cycle L1 miss, L2 hit – 6 cycles L2 miss – additional 64 cycles

119 Second experimental methodology Two platforms All benchmarks except one are C programs

120 Pentium measurements Dual processor 500Mhz Pentium III (but only one used) 100Mhz bus 512KB L2 cache Physical memory > 300MB (why keep it a secret?), which prevented paging and allowed the whole executable in memory RedHat 6.1 Benchmarks compiled using gcc with –O2

121 RISC measurements A single PA-8000/180 MHz processor Running HP/UX 11 Single level I and D caches, 1MB each

122 Benchmarks Execution time measurements are a five runs average The division between sweep and mark times is arbitrary Pentium III prefetcht0 introduced a new overhead, so prefetchnta was used. It was less effective eliminating cache miss, though

123 ?

124 The end Lectured by: Shachar Rubinstein shachar1@post.tau.ac.il GC seminar Molley Sagiv Audience: You Thanks: For your patience The Powerpoint XP effects My parents No animals were harmed during this production (except for annoying mosquitoes) Thank you for listening! (and staying awake…)


Download ppt "Reducing Garbage Collector Cache Misses Shachar Rubinstein Garbage Collection Seminar."

Similar presentations


Ads by Google