IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia.

Slides:

Advertisements

Similar presentations

Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,

Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.

Dynamic Memory Management

Chapter 6: Memory Management

Automatic Memory Management Noam Rinetzky Schreiber 123A /seminar/seminar1415a.html.

1 Overview Assignment 5: hints  Garbage collection Assignment 4: solution.

Chapter 4: Trees Part II - AVL Tree

Lecture 10: Heap Management CS 540 GMU Spring 2009.

Reducing Pause Time of Conservative Collectors Toshio Endo (National Institute of Informatics) Kenjiro Taura (Univ. of Tokyo)

MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.

5. Memory Management From: Chapter 5, Modern Compiler Design, by Dick Grunt et al.

Garbage Collection CSCI 2720 Spring Static vs. Dynamic Allocation Early versions of Fortran –All memory was static C –Mix of static and dynamic.

An On-the-Fly Mark and Sweep Garbage Collector Based on Sliding Views Hezi Azatchi - IBM Yossi Levanoni - Microsoft Harel Paz – Technion Erez Petrank –

On-the-Fly Garbage Collection: An Exercise in Cooperation Edsget W. Dijkstra, Leslie Lamport, A.J. Martin and E.F.M. Steffens Communications of the ACM,

MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.

CPSC 388 – Compiler Design and Construction

OOPSLA 2003 Mostly Concurrent Garbage Collection Revisited Katherine Barabash - IBM Haifa Research Lab. Israel Yoav Ossia - IBM Haifa Research Lab. Israel.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

MOSTLY PARALLEL GARBAGE COLLECTION Authors : Hans J. Boehm Alan J. Demers Scott Shenker XEROX PARC Presented by:REVITAL SHABTAI.

Connectivity-Based Garbage Collection Presenter Feng Xian Author Martin Hirzel, et.al Published in OOPSLA’2003.

JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Memory Management Chapter 5.

Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.

1 An Efficient On-the-Fly Cycle Collection Harel Paz, Erez Petrank - Technion, Israel David F. Bacon, V. T. Rajan - IBM T.J. Watson Research Center Elliot.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Garbage Collection Memory Management Garbage Collection –Language requirement –VM service –Performance issue in time and space.

A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao.

1 Overview Assignment 6: hints  Living with a garbage collector Assignment 5: solution  Garbage collection.

SEG Advanced Software Design and Reengineering TOPIC L Garbage Collection Algorithms.

Exploiting Prolific Types for Memory Management and Optimizations By Yefim Shuf et al.

ISMM 2004 Mostly Concurrent Compaction for Mark-Sweep GC Yoav Ossia, Ori Ben-Yitzhak, Marc Segal IBM Haifa Research Lab. Israel.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 346, Royden, Operating System Concepts Operating Systems Lecture 24 Paging.

Ulterior Reference Counting: Fast Garbage Collection without a Long Wait Author: Stephen M Blackburn Kathryn S McKinley Presenter: Jun Tao.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.

Incremental Garbage Collection Uwe Kern 23. Januar 2002

CS 149: Operating Systems March 3 Class Meeting Department of Computer Science San Jose State University Spring 2015 Instructor: Ron Mak

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

CS412/413 Introduction to Compilers and Translators April 21, 1999 Lecture 30: Garbage collection.

Reference Counting. Reference Counting vs. Tracing Advantages ✔ Immediate ✔ Object-local ✔ Overhead distributed ✔ Very simple Trivial implementation for.

Memory Management.

Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.

Java 9: The Quest for Very Large Heaps

Memory Management 6/20/ :27 PM

How will execution time grow with SIZE?

Chapter 9 – Real Memory Organization and Management

Concepts of programming languages

Main Memory Management

Optimizing Malloc and Free

Chapter 9: Virtual-Memory Management

Main Memory Background Swapping Contiguous Allocation Paging

Strategies for automatic memory management

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Chapter 12 Memory Management

COMP60621 Fundamentals of Parallel and Distributed Systems

Chapter 2: Operating-System Structures

CSE451 Virtual Memory Paging Autumn 2002

Chapter 2: Operating-System Structures

COMP60611 Fundamentals of Parallel and Distributed Systems

Reference Counting.

Presentation transcript:

IBM Haifa Research Laboratory November, 2002 Parallel, Incremental, and Mostly Concurrent GC Yoav Ossia

HRLHRL Motivation  Modern SMP servers introduce  Higher level of true parallelism  Multi-gigabyte heaps  Multi-threaded applications which must ensure fast response time  New demands from GC  Short pause time on large heaps  Minimal throughput hit  Scaleability on multi-processor hardware  Efficient algorithms for weak ordering hardware  We will not talk about this....  Workarounds, which do not work...  Bigger heaps  Object pooling

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Coding issues  Results (highlights)  Recent work Outline

HRLHRL Goals of this lecture  Present the ideas and algorithms  As done in the IBM JVM  Introduce the concerns of implementation  The devil's in the details  What's cheap and what's expensive  How does this gets into the design  Concurrent coding, the real story  Difficulties, and how to avoid them

HRLHRL Mark Sweep Compact GC (MSC)  Mark - traces all reachable (live) objects in heap  Needed data structures and operations:  Mark Stack (push & pop of references to objects)  SetMarked (Object *obj) & boolean IsMarked (Object *obj)  Start from all roots (Threads' stacks, system classes, etc)  Mark, and Push to mark stack  Process the mark stack, till empty  Pop Obj, and trace (mark and push all unmarked references from Obj)  Sweep  Coalesce unmarked objects into free chunks  Create a list of free chunks  Compact (the disaster zone)  Usually done when allocation request cannot be satisfied  Move and coalesce the live objects, to create bigger free chunks  Usually very long, unattractive

HRLHRL The MSC Process StacksGlobals Mark Compact ? Sweep Used Marked Free Dark matterUnmovable

HRLHRL MSC Subtle Issues  Mark  Proportional to amount of live objects only.  Risk of mark stack Overflow (e.g., linked list with Next as last field)  Avoid paging and cache misses  Sweep  Walk all the heap again  Naive method is proportional to amount of live AND dead objects  Partially solved by mark bit-vector  Bit for each basic heap unit (typically 8 bytes) Object mapped to specific bit  Walk the mark bit-vector, inspect heap only when found big holes in vector  Compact  Tradeoff : level of compaction Vs. speed  If not type-accurate GC, not all objects can be moved  Can't tell if a slot on stack is a reference, or a numeric value

HRLHRL Parallel STW MSC on N-Way  Usually GC thread per processor  Parallel mark  Synchronized marking  Load balancing needed (overflow, starvation)  Separate mark stacks for each thread  Stealing from designated "private" areas, attached to mark stacks (Endo et al)  Direct access to mark stacks of other threads (Flood et al)  Adding a single shared stack (Cheng & Blelloch)  Parallel sweep  Heap divided to M areas M > f * N  Synchronization needed for area selection, and free list concatenation  Compact  Tradeoff : Parallelism Vs. auxiliary data structures

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Coding issues  Results (highlights)  Recent work Outline

HRLHRL The Concurrent Collection Principle  STW may get to seconds  Cost of mark phase dominant  Mark may be done while program is active, except that the objects graph changes...  Correctness kept by use of Write barrier  Is activated at each change of reference field in an object (gray it!)  Functionality is algorithm-dependent, sometimes expensive

HRLHRL The Concurrent Collection What, Who, How  Mostly concurrent MSC  Tracing done while mutator threads are active (Boehm et al, Printezis & Detlefs)  Retrace (Clean) may be done while mutator threads are active  Short final STW  Last clean and resulting tracing, and Sweep  Originally done by separate thread/processor  "Real" concurrent  Incremental work  Tracing done incrementally by the mutator threads  First done on a copying collector (Baker)  Parallel execution  Concurrent phase is also parallel  Many threads can do concurrent work simultaneously

HRLHRL The IBM Mostly Concurrent GC  Putting together all existing elements  First production-level parallel incremental mostly concurrent MSC collector  Combining incremental and concurrent tracing  Efficient concurrent marking that terminates on time  New mechanism for parallel load balancing  Especially fit for dynamic number of participating threads  When compared to mature industry-quality GC  Drastic reduction in pause time (more than 75%)  Small throughput hit (~10%)

HRLHRL Phases of The Collector  Concurrent phase  Tracing of all reachable objects  Done incrementally by Java mutators and dedicated low-priority tracing threads  Write Barrier records changes per region (card) in a card table  Any change of reference in an object dirty the card  Alll black objects in card changed to gray  Fast and relatively cheap operation (2% - 6% throughput hit)  A single card cleaning pass  In each dirty card, retrace all marked objects  Cleaning may precede the actual tracing  Final STW phase  Root scanning and final card cleaning pass  Tracing of all additional objects  Parallel sweep  Now replaced by concurrent sweep

HRLHRL CPU Distribution

HRLHRL Write Barrier  Activated by the JVM, on each reference change done in Java  Writes in a card table  Each card 512 bytes in heap  Cleaning (concurrent or final) may happen anytime  Foo.a = O1  Store O1 in a root (guarantied to be reachable)  Set Foo.a to O1  Activate Write Barrier on Foo  Dirty the enrty of Foo in the card table  Remove O1 from root  Object may span on many cards  Usually mapped to the card where its header starts

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Coding issues  Results (highlights)  Recent work Outline

HRLHRL The Problem of Punctual termination  Traditional STW Collection starts when heap is full  produces minimal number of GCs  Mostly concurrent aims at completing the Concurrent Marking when heap becomes full  If heap get filled before CM terminates, rest of marking moved to final phase  Longer pause  If CM terminate before heap is filled, choose the lesser of two evils:  Wait for heap to be filled (and accumulate more dirty cards)  Initiate an "early" GC (more GCs, with all their additional costs)  Concurrent marking should be adaptive

HRLHRL Combining Concurrent and Incremental  Existing approaches  Incremental  Mutators do tracing proportional to allocation (Tracing Rate)  Tracing guarantied to terminate (more or less) on time  Decreases application performance  Specialized GC threads (concurrent)  More efficient, better CPU utilization  Tracing rate determined by ratio between GC and program threads  Usually not changed by the collector  No control of termination point  Hybrid usage of both ways  Low priority background threads fully utilize CPU idle time  Not controlled by the tracing rate  Mutators perform incremental tracing, to ensure proper termination  Only if tracing goals not met by background threads  Control "Milestones" (concurrent start, card cleaning pass start, etc.)

HRLHRL Metering Formulas  Kickoff point of concurrent  User-specified Tracing Rate (TR)  Live Objects estimation (L est ), Dirty objects estimation (M est )  Start concurrent when free space gets below (Lest+M est ) / TR  So tracing rate, applied to remaining allocations, match the tracing work  Calculating The work  Amount of concurrently traced objects (T raced )  Amount of remaining free memory (F ree )  Estimated background threads tracing rate (B est )  Rate between total amounts of background tracing and allocations  Dynamically recalculate actual rate ATR = ((L est + M est - T raced ) / F ree )  Rate between remaining work and free space  Account for work done in background ATR 2 = ATR - B est  Trace only if background tracing lags

HRLHRL Behavior Patterns CPU Usage Time Parallel STW Java mutation Incremental tracing Background tracing STW MSC GC Throughput 100% Con. Tr Rate 3 Throughput 80% Con. Tr Rate 8 Throughput 90% Con. Tr Rate 8 CPU 80% Throughput 95% Con. Tr Rate 8 CPU 50% Throughput 110% Time

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Coding issues  Results (highlights)  Recent work Outline

HRLHRL Load Balancing - the Problem  Even distribution of objects between parallel tracing threads  Avoid mark stack overflow and/or starvation of threads  Suitable for unknown number to collection threads  Efficient synchronization  Supply simple termination detection  Existing approaches  All use separate mark stacks for each thread  Stealing from designated "private" areas, attached to mark stacks (Endo et al)  Direct access to mark stacks of other threads (Flood et al)  Adding a single shared stack (Cheng & Blelloch)

HRLHRL Load Balancing for Concurrent  Pools of WorkPackets  Each is a smaller mark stack  Cheap Get/Put synchronization  (compare & swap)  Separate pools for different occupancy  Each pool maintains a counter  Tracing thread uses 2 WorkPacket  Objects are popped from the input WP  Newly marked object are pushed to the output WP  Empty input WP is returned to the "Empty" pool  New "as-full-as-possible" WP is then fetched  Full output WP is returned to the "Full" pool  New "as-empty-as-possible" WP is then fetched  Different object graph traversal  BFS, limited by the capacity of a WorkPacket Full Non-Full Non-Empty Empty

HRLHRL Advantages of WorkPackets  Fair competition when input is scarce  All threads get same chance for tracing input  Simple detection of tracing state  Overflow - All packets are full  Scaleability is possible, simply allocate more WPs  Starvation - Only empty WPs available, but not all WPs in the "Empty" list  Termination - all WPs in the "Empty" list  Positive results measured  Low cost of synchronization  Fair distribution of work among threads

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Coding issues  Results (highlights)  Recent work Outline

HRLHRL Concurrent Code Maintenance  Extremely difficult to verify:  Races between concurrent tracing and program  Races between concurrent tracers  Timing is a major factor  Debug version cannot reproduce Release bugs  Problems surface only occasionally  Behavior is machine dependent  About 40% verification code  Sanity checks  Asserts, consistency checks  Logging of collection activity, state, and history  Shadow heap, for tracing history  Shadow card table, for card state and treatment  Code to use the above for printing detailed information.

HRLHRL  Introduction  Principles of concurrent collector  Dividing the concurrent work  Parallel load balancing mechanism  Results (highlights)  Recent work Outline

HRLHRL Comparison with STW GC  Compared to STW MSC  Using IBM's production level JVM  4-way machines  NT, AIX and IA64  Mostly testing SPECjbb  Server-side Java  Throughput driven  60% live objects  Pause time cut by 75%  Mark time by 86%.  Sweep become dominant  Throughput hit of 10%

HRLHRL Comparison with STW GC (cont.)  Also testing pBOB  IBM internal benchmark  Fit for 2.5 GB heap, with  low CPU utilization  Many threads

HRLHRL Effects of Various Tracing Rates  Mutator utilization - amount of Java mutation  done during the concurrent phase  Also controls the size of the "per thread mini-STW"

HRLHRL Effects of Various Tracing Rates  Floating garbage - marked objects that become unreachable before the final STW phase.  Amount of cards cleaned

HRLHRL References  A parallel, incremental and concurrent GC for servers.  Ossia, Ben-Yitzhak, Goft, Kolodner, Leikehman, Owshanko. PLDI '02  Mostly parallel garbage collection.  Boehm, Demers, Shenker. ACM SIGPLAN Notices,  On-the-fly Garbage Collection: An exercise in cooperation.  Dijkstra, Lamport, Martin, Scholten, Steffens. ACM comm.,  A generational mostly-concurrent garbage collector.  Printezis, Detlefs. ISMM 2000  And many more...

HRLHRL  Dividing the concurrent work  Parallel load balancing mechanism  Results (highlights)  Recent work  Introduction Outline

HRLHRL Concurrent Sweep  Sweep became the dominant part of the remaining pause time  Except for the  needed initial allocation,  rest of sweep can be  deferred  Concurrent sweep done incrementally  After the final phase, and before the next concurrent collection  Work done on each allocation request  No additional performance cost

HRLHRL Improving the Low Tracing Rate  Low tracing rate is more application friendly  More CPU left to program  Shorter tracing periods forced on threads  But throughput is reduced  Goal : Improve throughput with minimal hit on pause times  Achieved by reducing dirty cards, and floating garbage.  Better performance  Reduced heap residency

End