© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with Parallel Hierarchical Copying GC David Siegwart, IBM Software Group Martin Hirzel, IBM Watson Research Center
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 2 Talk Summary Motivation Background & Related Work Hierarchical Copying GC, Parallelized. Evaluation across wide range of benchmarks. Conclusions
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 3 Motivation Improving Locality: –Commercial workloads spend 45% stalled in memory requests. [Adl-Tabatabai et al, PLDI’04 - SPECjbb2000 on Itanium II] –Object order in memory influences misses. –Copying GC can relocate objects, changing object ordering. –Objective: co-locate objects that are used together, on the same page or cache line. Maintaining Scalability: –parallelism and workload balancing is essential for server workloads
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 4 Related Objects are Used Together Looked at Consecutive Field Accesses: –Siblings –child-parent for SPECjbb2005: –29% siblings –14% child-parent for a Trade6 Primitive: (J2EE Benchmark) –36% siblings –8% child-parent Copying GC should have: –good locality for siblings –good locality for child-parent.
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 5 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical parallel + load balancing + hierarchical – rescanning
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 6 Cheney Copying GC – Good for Siblings o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o8o8 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 Breadth first scan free To-space scan parent child free copied copied & scanned free scan free scan
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 7 0% 5% 10% 15% 20% 25% 30% Scanned Slot to Copied Object Distance (Log 2 2 ) Proportion Cheney (Breadth First) Cheney Copying GC – Bad for Parent-Child (SPECjbb2005) 64 byte cache line page size (4 kB) – Increases working set, hence TLB misses and L2 cache misses
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 8 Depth-First Copying – Good for Parent-Child o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o8o8 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 – Bad for Siblings (o 4, o 5, o 6, o 7 are on separate pages)
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 9 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical parallel + load balancing + hierarchical – rescanning
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 10 Moon’s Hierarchical Copying GC To-space o8o8 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 free partial = scan free partial = scan Two scan pointers: scan, partial scan free partial scan free partial scan partial = free ABDCE re-scanned scan partial = free scan partial = free
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 11 Wilson, Lam & Moher’s Hierarchical Copying GC o8o8 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 scanA free scanBscanCscanDscanE scan block = copy block free scanCscanBscanDscanEscanA scan block = copy block free scanCscanDscanAscanBscanE scan block = copy block free scanAscanBscanCscanDscanE scan block = copy block scan pointer in each block: avoids re-scanning aliasing scan block to copy block reduces copy-scan distances To-space ABDCE scanC = free scanBscanAscanDscanE scan block ≠ copy block scanEscanDscanAscanB scanC = free scan block ≠ copy block
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 12 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical parallel + load balancing + hierarchical – rescanning
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 13 Imai and Tick’s Parallel Copying GC To-space... Work Pool Thread 1 Thread 2 scan block ≠ copy block scan block = copy block (aliased) Thread n...
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 14 Recognising the Connection... Work Pool Thread 1 Thread 2 scan block ≠ copy block scan block = copy block (aliased) Wilson, Lam & Moher (hierarchical, not parallel) Imai & Tick (parallel, not hierarchical) the immediacy of aliasing in WLM is what distinguishes it from Imai and Tick. So immediate aliasing in Imai & Tick gives hierarchical copying. Need to increase aliasing in Imai & Tick to improve locality.
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 15 Immediate Aliasing Check for aliasing opportunity immediately after each reference slot in each object has been scanned. Interrupt scanning at this point, and restart with the aliased block Easier to see via transition diagram
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 16 Parallel Hierarchical – Block State Transitions freelistcopy scandonescanlist aliased shared data
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 17 Parallel Hierarchical – Block State Transitions freelistcopy scandonescanlist aliased shared data
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 18 0% 5% 10% 15% 20% 25% 30% Scanned Slot to Copied Object Distance (Log 2 ) Proportion Breadth-First Hierarchical Parent-Child Distances for Parallel Hierarchical (SPECjbb2005) 64 byte cache line page size (4 kB) – less TLB misses, less L2 cache misses
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 19 Baseline GC IBM J9 JVM, GC has two Generations: Parallel copying for the young generation: –two semi-spaces –most GC’s are of this type. Concurrent mark for the old generation: –stop-the-world phase. (rare, compared to young collection)
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation % -5% 0% 5% 10% 15% 20% 25% SPECjbb2005 db javasrc mtrt jbytemark javac chart jpat banshee javalex jython eclipse mpegaudio compress fop hsqldb kawa soot batik jack antlr jess ps bloat pmd ipsixql % Speedups (1 - PH/BF) heap size 10x min, except SPECjbb2005 Results – 26 Benchmark Suite
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 21 Results – Scalability SPECjbb2005 Windows 2000 Advanced Server SP4 4x(1.6GHz HT Pentium 4 Xeon) 256kB L2 (64byte cache line), 1MB L3, 2GB RAM Base Build: J9 5.0 GA pwi32dev Warehouses Throughputt Hierarchical Breadth-First
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 22 GC Scaling – SPECjbb2005 Windows 2000 Advanced Server SP4 4x(1.6GHz HT Pentium 4 Xeon) 256kB L2 (64byte cache line), 1MB L3, 2GB RAM Base Build: J9 5.0 GA pwi32dev
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 23 Mutator vs Collector - db Linux 1x(3.06 GHz HT Pentium 4 Xeon) 512kB L2 (64byte cache line), 1GB RAM Base Build: J9 5.0 GA pxi32dev Mutator Time Heap Size relative to minimum heap size Normalized Mutator Time. Hierarchical Breadth-First Heap Size relative to minimum heap size Normalized GC Time. Hierarchical Breadth-First GC Time
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 24 Cache & TLB Misses - db Linux 1x(3.06 GHz HT Pentium 4 Xeon) 512kB L2 (64byte cache line), 1GB RAM Base Build: J9 5.0 GA pxi32dev Heap Size relative to minimum heap size Normalized Mutator L1 Cache Misses. Hierarchical Breadth-First Heap Size relative to minimum heap size Normalized Mutator TLB Misses. Hierarchical Breadth-First
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 25 Conclusions Introduced a new algorithm: –Improves Memory Locality –Maintains Good Scalability Two technologies in one – hierarchical decomposition and parallel copying GC. Requires no online profiling. Evaluated across wide range of benchmarks: –better locality, dramatic reduction TLB misses, and also reduces L1 misses. –cost on collector outweighed by benefit to mutator. –Majority of benchmarks show improvements.
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 26 Backup
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 27 Related Work Ch./La‘ 98 Huang ‘04 Shuf ‘02 Shuf ’02 Adl-T. ‘04 Latt- ner‘04 La./ Ad. ’05 Ch./Hi. ‘01 Casca val‘05 Moon ‘84 Kistler/ Fra.‘03 Wi/La/ Mo.’91 L1 L2 TLB Paging C/C++ Java Lisp … C/C++ Java Lisp … OS Allocator Prefetching Moving GC OS Allocator Prefetching Moving GC
ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 28 Results – 26 Benchmark Suite – other heap sizes