Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with.

Similar presentations


Presentation on theme: "© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with."— Presentation transcript:

1 © 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with Parallel Hierarchical Copying GC David Siegwart, IBM Software Group Martin Hirzel, IBM Watson Research Center

2 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 2 Talk Summary  Motivation  Background & Related Work  Hierarchical Copying GC, Parallelized.  Evaluation across wide range of benchmarks.  Conclusions

3 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 3 Motivation  Improving Locality: –Commercial workloads spend 45% stalled in memory requests. [Adl-Tabatabai et al, PLDI’04 - SPECjbb2000 on Itanium II] –Object order in memory influences misses. –Copying GC can relocate objects, changing object ordering. –Objective: co-locate objects that are used together, on the same page or cache line.  Maintaining Scalability: –parallelism and workload balancing is essential for server workloads

4 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 4 Related Objects are Used Together  Looked at Consecutive Field Accesses: –Siblings –child-parent  for SPECjbb2005: –29% siblings –14% child-parent  for a Trade6 Primitive: (J2EE Benchmark) –36% siblings –8% child-parent  Copying GC should have: –good locality for siblings –good locality for child-parent.

5 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 5 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical 1970 1984 2006 1985 1993 1992 + parallel + load balancing + hierarchical – rescanning

6 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 6 Cheney Copying GC – Good for Siblings o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o8o8 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 Breadth first scan free To-space scan parent child free copied copied & scanned free scan free scan

7 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 7 0% 5% 10% 15% 20% 25% 30% 1234567891011121314151617181920212223242526 Scanned Slot to Copied Object Distance (Log 2 2 ) Proportion Cheney (Breadth First) Cheney Copying GC – Bad for Parent-Child (SPECjbb2005) 64 byte cache line page size (4 kB) – Increases working set, hence TLB misses and L2 cache misses

8 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 8 Depth-First Copying – Good for Parent-Child o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o8o8 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 – Bad for Siblings (o 4, o 5, o 6, o 7 are on separate pages)

9 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 9 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical 1970 1984 2006 1985 1993 1992 + parallel + load balancing + hierarchical – rescanning

10 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 10 Moon’s Hierarchical Copying GC To-space o8o8 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 free partial = scan free partial = scan Two scan pointers: scan, partial scan free partial scan free partial scan partial = free ABDCE re-scanned scan partial = free scan partial = free

11 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 11 Wilson, Lam & Moher’s Hierarchical Copying GC o8o8 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 scanA free scanBscanCscanDscanE scan block = copy block free scanCscanBscanDscanEscanA scan block = copy block free scanCscanDscanAscanBscanE scan block = copy block free scanAscanBscanCscanDscanE scan block = copy block scan pointer in each block: avoids re-scanning aliasing scan block to copy block reduces copy-scan distances To-space ABDCE scanC = free scanBscanAscanDscanE scan block ≠ copy block scanEscanDscanAscanB scanC = free scan block ≠ copy block

12 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 12 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical 1970 1984 2006 1985 1993 1992 + parallel + load balancing + hierarchical – rescanning

13 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 13 Imai and Tick’s Parallel Copying GC To-space... Work Pool Thread 1 Thread 2 scan block ≠ copy block scan block = copy block (aliased) Thread n...

14 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 14 Recognising the Connection... Work Pool Thread 1 Thread 2 scan block ≠ copy block scan block = copy block (aliased) Wilson, Lam & Moher (hierarchical, not parallel) Imai & Tick (parallel, not hierarchical) the immediacy of aliasing in WLM is what distinguishes it from Imai and Tick. So immediate aliasing in Imai & Tick gives hierarchical copying.  Need to increase aliasing in Imai & Tick to improve locality.

15 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 15 Immediate Aliasing  Check for aliasing opportunity immediately after each reference slot in each object has been scanned.  Interrupt scanning at this point, and restart with the aliased block  Easier to see via transition diagram

16 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 16 Parallel Hierarchical – Block State Transitions freelistcopy scandonescanlist aliased shared data

17 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 17 Parallel Hierarchical – Block State Transitions freelistcopy scandonescanlist aliased shared data

18 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 18 0% 5% 10% 15% 20% 25% 30% 1234567891011121314151617181920212223242526 Scanned Slot to Copied Object Distance (Log 2 ) Proportion Breadth-First Hierarchical Parent-Child Distances for Parallel Hierarchical (SPECjbb2005) 64 byte cache line page size (4 kB) – less TLB misses, less L2 cache misses

19 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 19 Baseline GC  IBM J9 JVM, GC has two Generations:  Parallel copying for the young generation: –two semi-spaces –most GC’s are of this type.  Concurrent mark for the old generation: –stop-the-world phase. (rare, compared to young collection)

20 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 20 -10% -5% 0% 5% 10% 15% 20% 25% SPECjbb2005 db javasrc mtrt jbytemark javac chart jpat banshee javalex jython eclipse mpegaudio compress fop hsqldb kawa soot batik jack antlr jess ps bloat pmd ipsixql % Speedups (1 - PH/BF) heap size 10x min, except SPECjbb2005 Results – 26 Benchmark Suite

21 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 21 Results – Scalability SPECjbb2005 Windows 2000 Advanced Server 5.0.2195 SP4 4x(1.6GHz HT Pentium 4 Xeon) 256kB L2 (64byte cache line), 1MB L3, 2GB RAM Base Build: J9 5.0 GA pwi32dev-20051104 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 012345678910111213141516 Warehouses Throughputt Hierarchical Breadth-First

22 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 22 GC Scaling – SPECjbb2005 Windows 2000 Advanced Server 5.0.2195 SP4 4x(1.6GHz HT Pentium 4 Xeon) 256kB L2 (64byte cache line), 1MB L3, 2GB RAM Base Build: J9 5.0 GA pwi32dev-20051104

23 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 23 Mutator vs Collector - db Linux 1x(3.06 GHz HT Pentium 4 Xeon) 512kB L2 (64byte cache line), 1GB RAM Base Build: J9 5.0 GA pxi32dev-20051104 Mutator Time 1 1.1 1.2 1.3 1.4 1.5 12345678910 Heap Size relative to minimum heap size Normalized Mutator Time. Hierarchical Breadth-First 1 1.5 2 2.5 3 12345678910 Heap Size relative to minimum heap size Normalized GC Time. Hierarchical Breadth-First GC Time

24 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 24 Cache & TLB Misses - db Linux 1x(3.06 GHz HT Pentium 4 Xeon) 512kB L2 (64byte cache line), 1GB RAM Base Build: J9 5.0 GA pxi32dev-20051104 1 1.1 1.2 1.3 1.4 1.5 12345678910 Heap Size relative to minimum heap size Normalized Mutator L1 Cache Misses. Hierarchical Breadth-First 1 1.1 1.2 1.3 1.4 1.5 12345678910 Heap Size relative to minimum heap size Normalized Mutator TLB Misses. Hierarchical Breadth-First

25 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 25 Conclusions  Introduced a new algorithm: –Improves Memory Locality –Maintains Good Scalability  Two technologies in one – hierarchical decomposition and parallel copying GC.  Requires no online profiling.  Evaluated across wide range of benchmarks: –better locality, dramatic reduction TLB misses, and also reduces L1 misses. –cost on collector outweighed by benefit to mutator. –Majority of benchmarks show improvements.

26 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 26 Backup

27 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 27 Related Work Ch./La‘ 98 Huang ‘04 Shuf ‘02 Shuf ’02 Adl-T. ‘04 Latt- ner‘04 La./ Ad. ’05 Ch./Hi. ‘01 Casca val‘05 Moon ‘84 Kistler/ Fra.‘03 Wi/La/ Mo.’91 L1 L2 TLB Paging C/C++ Java Lisp … C/C++ Java Lisp … OS Allocator Prefetching Moving GC OS Allocator Prefetching Moving GC

28 ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 28 Results – 26 Benchmark Suite – other heap sizes


Download ppt "© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with."

Similar presentations


Ads by Google