© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with.

Slides:



Advertisements
Similar presentations
An Implementation of Mostly- Copying GC on Ruby VM Tomoharu Ugawa The University of Electro-Communications, Japan.
Advertisements

Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.
1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
On-the-Fly Garbage Collection Using Sliding Views Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni, Hezi Azatchi,
Incorporating Generations into a Modern Reference Counting Garbage Collector Hezi Azatchi Advisor: Erez Petrank.
Virtual Memory Primitives for User Programs Andrew W. Appel and Kai Li Presented by Phil Howard.
MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.
An On-the-Fly Mark and Sweep Garbage Collector Based on Sliding Views Hezi Azatchi - IBM Yossi Levanoni - Microsoft Harel Paz – Technion Erez Petrank –
MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
Using Prefetching to Improve Reference-Counting Garbage Collectors Harel Paz IBM Haifa Research Lab Erez Petrank Microsoft Research and Technion.
Free-Me: A Static Analysis for Individual Object Reclamation Samuel Z. Guyer Tufts University Kathryn S. McKinley University of Texas at Austin Daniel.
OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel.
1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.
Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.
An On-the-Fly Reference Counting Garbage Collector for Java Erez Petrank Technion – Israel Institute of Technology Joint work with Yossi Levanoni – Microsoft.
Connectivity-Based Garbage Collection Presenter Feng Xian Author Martin Hirzel, et.al Published in OOPSLA’2003.
Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.
1 An Efficient On-the-Fly Cycle Collection Harel Paz, Erez Petrank - Technion, Israel David F. Bacon, V. T. Rajan - IBM T.J. Watson Research Center Elliot.
Scalable Locality- Conscious Multithreaded Memory Allocation Scott Schneider Christos D. Antonopoulos Dimitrios S. Nikolopoulos The College of William.
1 Reducing Generational Copy Reserve Overhead with Fallback Compaction Phil McGachey and Antony L. Hosking June 2006.
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
Using Generational Garbage Collection To Implement Cache- conscious Data Placement Trishul M. Chilimbi & James R. Larus מציג : ראובן ביק.
Mark and Split Kostis Sagonas Uppsala Univ., Sweden NTUA, Greece Jesper Wilhelmsson Uppsala Univ., Sweden.
A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Optimizing RAM-latency Dominated Applications
The College of William and Mary 1 Influence of Program Inputs on the Selection of Garbage Collectors Feng Mao, Eddy Zheng Zhang and Xipeng Shen.
Tolerating Memory Leaks Michael D. Bond Kathryn S. McKinley.
Taking Off The Gloves With Reference Counting Immix
ISMM 2004 Mostly Concurrent Compaction for Mark-Sweep GC Yoav Ossia, Ori Ben-Yitzhak, Marc Segal IBM Haifa Research Lab. Israel.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Connectivity-Based Garbage Collection Martin Hirzel University of Colorado at Boulder Collaborators: Amer Diwan, Michael Hind, Hal Gabow, Johannes Henkel,
An Adaptive, Region-based Allocator for Java Feng Qian, Laurie Hendren {fqian, Sable Research Group School of Computer Science McGill.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Oct Using Platform-Specific Performance Counters for Dynamic Compilation Florian Schneider and Thomas Gross ETH Zurich.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
A Mostly Non-Copying Real-Time Collector with Low Overhead and Consistent Utilization David Bacon Perry Cheng (presenting) V.T. Rajan IBM T.J. Watson Research.
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.
Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.
Free-Me: A Static Analysis for Automatic Individual Object Reclamation Samuel Z. Guyer, Kathryn McKinley, Daniel Frampton Presented by: Dimitris Prountzos.
1 Data layouts for object-oriented programs Martin Hirzel IBM Research SIGMETRICS 6/16/2007.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
Immix: A Mark-Region Garbage Collector Curtis Dunham CS 395T Presentation Feb 2, 2011 Thanks to Steve Blackburn and Jennifer Sartor for their 2008 and.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
David F. Bacon Perry Cheng V.T. Rajan IBM T.J. Watson Research Center ControllingFragmentation and Space Consumption in the Metronome.
Dynamic Selection of Application-Specific Garbage Collectors Sunil V. Soman Chandra Krintz University of California, Santa Barbara David F. Bacon IBM T.J.
A REAL-TIME GARBAGE COLLECTOR WITH LOW OVERHEAD AND CONSISTENT UTILIZATION David F. Bacon, Perry Cheng, and V.T. Rajan IBM T.J. Watson Research Center.
Object-Relative Addressing: Compressed Pointers in 64-bit Java Virtual Machines Kris Venstermans, Lieven Eeckhout, Koen De Bosschere Department of Electronics.
2/4/20161 GC16/3011 Functional Programming Lecture 20 Garbage Collection Techniques.
1 GC Advantage: Improving Program Locality Xianglong Huang, Zhenlin Wang, Stephen M Blackburn, Kathryn S McKinley, J Eliot B Moss, Perry Cheng.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
The Metronome Washington University in St. Louis Tobias Mann October 2003.
1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.
Institute of Parallel and Distributed Systems (IPADS)
Cork: Dynamic Memory Leak Detection with Garbage Collection
No Bit Left Behind: The Limits of Heap Data Compression
NumaGiC: A garbage collector for big-data on big NUMA machines
David F. Bacon, Perry Cheng, and V.T. Rajan
Mark Claypool and Jonathan Tanner Computer Science Department
Jipeng Huang, Michael D. Bond Ohio State University
No Bit Left Behind: The Limits of Heap Data Compression
Garbage Collection Advantage: Improving Program Locality
Program-level Adaptive Memory Management
Presentation transcript:

© 2005 IBM Corporation ISMM’06 Ottawa, Ontario, Canada June 10 th 2006 | ISMM’06 Ottawa, Ontario, Canada © 2006 IBM Corporation Improving Locality with Parallel Hierarchical Copying GC David Siegwart, IBM Software Group Martin Hirzel, IBM Watson Research Center

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 2 Talk Summary  Motivation  Background & Related Work  Hierarchical Copying GC, Parallelized.  Evaluation across wide range of benchmarks.  Conclusions

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 3 Motivation  Improving Locality: –Commercial workloads spend 45% stalled in memory requests. [Adl-Tabatabai et al, PLDI’04 - SPECjbb2000 on Itanium II] –Object order in memory influences misses. –Copying GC can relocate objects, changing object ordering. –Objective: co-locate objects that are used together, on the same page or cache line.  Maintaining Scalability: –parallelism and workload balancing is essential for server workloads

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 4 Related Objects are Used Together  Looked at Consecutive Field Accesses: –Siblings –child-parent  for SPECjbb2005: –29% siblings –14% child-parent  for a Trade6 Primitive: (J2EE Benchmark) –36% siblings –8% child-parent  Copying GC should have: –good locality for siblings –good locality for child-parent.

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 5 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical parallel + load balancing + hierarchical – rescanning

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 6 Cheney Copying GC – Good for Siblings o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o8o8 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 Breadth first scan free To-space scan parent child free copied copied & scanned free scan free scan

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 7 0% 5% 10% 15% 20% 25% 30% Scanned Slot to Copied Object Distance (Log 2 2 ) Proportion Cheney (Breadth First) Cheney Copying GC – Bad for Parent-Child (SPECjbb2005) 64 byte cache line page size (4 kB) – Increases working set, hence TLB misses and L2 cache misses

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 8 Depth-First Copying – Good for Parent-Child o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o8o8 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 – Bad for Siblings (o 4, o 5, o 6, o 7 are on separate pages)

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 9 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical parallel + load balancing + hierarchical – rescanning

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 10 Moon’s Hierarchical Copying GC To-space o8o8 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 free partial = scan free partial = scan Two scan pointers: scan, partial scan free partial scan free partial scan partial = free ABDCE re-scanned scan partial = free scan partial = free

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 11 Wilson, Lam & Moher’s Hierarchical Copying GC o8o8 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 o7o7 o9o9 o 10 o 11 o 12 o 13 o 14 o 15 scanA free scanBscanCscanDscanE scan block = copy block free scanCscanBscanDscanEscanA scan block = copy block free scanCscanDscanAscanBscanE scan block = copy block free scanAscanBscanCscanDscanE scan block = copy block scan pointer in each block: avoids re-scanning aliasing scan block to copy block reduces copy-scan distances To-space ABDCE scanC = free scanBscanAscanDscanE scan block ≠ copy block scanEscanDscanAscanB scanC = free scan block ≠ copy block

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 12 Background Cheney Moon Wilson/Lam/Moher Halstead Imai/Tick Parallel Hierarchical parallel + load balancing + hierarchical – rescanning

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 13 Imai and Tick’s Parallel Copying GC To-space... Work Pool Thread 1 Thread 2 scan block ≠ copy block scan block = copy block (aliased) Thread n...

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 14 Recognising the Connection... Work Pool Thread 1 Thread 2 scan block ≠ copy block scan block = copy block (aliased) Wilson, Lam & Moher (hierarchical, not parallel) Imai & Tick (parallel, not hierarchical) the immediacy of aliasing in WLM is what distinguishes it from Imai and Tick. So immediate aliasing in Imai & Tick gives hierarchical copying.  Need to increase aliasing in Imai & Tick to improve locality.

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 15 Immediate Aliasing  Check for aliasing opportunity immediately after each reference slot in each object has been scanned.  Interrupt scanning at this point, and restart with the aliased block  Easier to see via transition diagram

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 16 Parallel Hierarchical – Block State Transitions freelistcopy scandonescanlist aliased shared data

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 17 Parallel Hierarchical – Block State Transitions freelistcopy scandonescanlist aliased shared data

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 18 0% 5% 10% 15% 20% 25% 30% Scanned Slot to Copied Object Distance (Log 2 ) Proportion Breadth-First Hierarchical Parent-Child Distances for Parallel Hierarchical (SPECjbb2005) 64 byte cache line page size (4 kB) – less TLB misses, less L2 cache misses

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 19 Baseline GC  IBM J9 JVM, GC has two Generations:  Parallel copying for the young generation: –two semi-spaces –most GC’s are of this type.  Concurrent mark for the old generation: –stop-the-world phase. (rare, compared to young collection)

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation % -5% 0% 5% 10% 15% 20% 25% SPECjbb2005 db javasrc mtrt jbytemark javac chart jpat banshee javalex jython eclipse mpegaudio compress fop hsqldb kawa soot batik jack antlr jess ps bloat pmd ipsixql % Speedups (1 - PH/BF) heap size 10x min, except SPECjbb2005 Results – 26 Benchmark Suite

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 21 Results – Scalability SPECjbb2005 Windows 2000 Advanced Server SP4 4x(1.6GHz HT Pentium 4 Xeon) 256kB L2 (64byte cache line), 1MB L3, 2GB RAM Base Build: J9 5.0 GA pwi32dev Warehouses Throughputt Hierarchical Breadth-First

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 22 GC Scaling – SPECjbb2005 Windows 2000 Advanced Server SP4 4x(1.6GHz HT Pentium 4 Xeon) 256kB L2 (64byte cache line), 1MB L3, 2GB RAM Base Build: J9 5.0 GA pwi32dev

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 23 Mutator vs Collector - db Linux 1x(3.06 GHz HT Pentium 4 Xeon) 512kB L2 (64byte cache line), 1GB RAM Base Build: J9 5.0 GA pxi32dev Mutator Time Heap Size relative to minimum heap size Normalized Mutator Time. Hierarchical Breadth-First Heap Size relative to minimum heap size Normalized GC Time. Hierarchical Breadth-First GC Time

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 24 Cache & TLB Misses - db Linux 1x(3.06 GHz HT Pentium 4 Xeon) 512kB L2 (64byte cache line), 1GB RAM Base Build: J9 5.0 GA pxi32dev Heap Size relative to minimum heap size Normalized Mutator L1 Cache Misses. Hierarchical Breadth-First Heap Size relative to minimum heap size Normalized Mutator TLB Misses. Hierarchical Breadth-First

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 25 Conclusions  Introduced a new algorithm: –Improves Memory Locality –Maintains Good Scalability  Two technologies in one – hierarchical decomposition and parallel copying GC.  Requires no online profiling.  Evaluated across wide range of benchmarks: –better locality, dramatic reduction TLB misses, and also reduces L1 misses. –cost on collector outweighed by benefit to mutator. –Majority of benchmarks show improvements.

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 26 Backup

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 27 Related Work Ch./La‘ 98 Huang ‘04 Shuf ‘02 Shuf ’02 Adl-T. ‘04 Latt- ner‘04 La./ Ad. ’05 Ch./Hi. ‘01 Casca val‘05 Moon ‘84 Kistler/ Fra.‘03 Wi/La/ Mo.’91 L1 L2 TLB Paging C/C++ Java Lisp … C/C++ Java Lisp … OS Allocator Prefetching Moving GC OS Allocator Prefetching Moving GC

ISMM’06 Ottawa, Ontario, Canada Improving Locality with Parallel Hierarchical Copying GC | June 10 th 2006 © 2006 IBM Corporation 28 Results – 26 Benchmark Suite – other heap sizes