Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Slides:



Advertisements
Similar presentations
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
Advertisements

Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
A Block-structured Heap Simplifies Parallel GC Simon Marlow (Microsoft Research) Roshan James (U. Indiana) Tim Harris (Microsoft Research) Simon Peyton.
Comparing and Optimising Parallel Haskell Implementations on Multicore Jost Berthold Simon Marlow Abyd Al Zain Kevin Hammond.
Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^
A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
Distributed Systems CS
Assessing the Scalability of Garbage Collectors on Many Cores (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Introduction to MIMD architectures
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
The Google File System (GFS). Introduction Special Assumptions Consistency Model System Design System Interactions Fault Tolerance (Results)
Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.
1 Lecture 1: Parallel Architecture Intro Course organization:  ~5 lectures based on Culler-Singh textbook  ~5 lectures based on Larus-Rajwar textbook.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Synchronization and Scheduling in Multiprocessor Operating Systems
Exploiting Prolific Types for Memory Management and Optimizations By Yefim Shuf et al.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
1 The Google File System Reporter: You-Wei Zhang.
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Bill Au CBS Interactive Troubleshooting Slow or Hung Java Applications.
Bill Au CBS Interactive Troubleshooting Slow or Hung Java Applications.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Embedded System Lab. 김해천 Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Copyright (c) 2004 Borys Bradel Myths and Realities: The Performance Impact of Garbage Collection Paper: Stephen M. Blackburn, Perry Cheng, and Kathryn.
Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
Scale Fail or, how I learned to stop worrying and love the downtime.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
2/14/01RightOrder : Telegraph & Java1 Telegraph Java Experiences Sam Madden UC Berkeley
CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
NUMA Optimization of Java VM
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Java 9: The Quest for Very Large Heaps
A task-based implementation for GeantV
CMSC 611: Advanced Computer Architecture
NumaGiC: A garbage collector for big-data on big NUMA machines
Department of Computer Science University of California, Santa Barbara
David F. Bacon, Perry Cheng, and V.T. Rajan
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Introduction to Multiprocessors
High Performance Computing
CS510 - Portland State University
by Mikael Bjerga & Arne Lange
Database System Architectures
Department of Computer Science University of California, Santa Barbara
Garbage Collection Advantage: Improving Program Locality
Tim Harris (MSR Cambridge)
Parallel DBMS DBMS Textbook Chapter 22
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA

Introduction Why? – Heavy use of Managed Runtime Environments Application servers Scientific applications Example: Jboss, Sunflow etc. – Hardware is more and more multi-resourced. – GC performance is critical. – Existing GCs developed for SMPs. What? – Assess GC scalability : Empirical Results. – Possible factors affecting the GC scalability. – Our approach to fixing them. Lokesh Gidra2

Contemporary Architecture C0 C1 C5 L2 L3 DRAM C0 C1 C5 L2 L3 DRAM Our machine has 8 such nodes with 6 cores each Non Uniform Memory Access (NUMA) Remote access >> Local access Non Uniform Memory Access (NUMA) Remote access >> Local access Lokesh Gidra Node 0Node 1

GC Scalability (Lusearch) Pause time increases with GC threads  Negative Scalability! Lokesh Gidra4 HotSpot JVM’s Garbage Collectors Pause Time GC Threads Application Threads Application Time

Trivial Bottleneck Scalable synchronization primitives are vital. GC task queue uses a monitor – Unnecessarily blocks GC threads. Replaced with lock-free version. No barrier for GC threads after GC completion. Trivial but very important: Up to 80% improvement. Lokesh Gidra5

Main Bottleneck Remote access and … Remote access! 7 out of 8 accesses are remote – When scanning an object (87.7% remote) – When copying an object (82.7% remote) – When stealing for load balancing (2-4 bus ops/steal) Lokesh Gidra6

Our Approach: Big Picture Improve GC locality – Local Scan – Local Copy – Local Stealing Tradeoff: – Locality vs. Load Balance Fix young generation of ParallelScavenge. Lokesh Gidra7

Avoid Remote Access Lokesh Gidra8 Node 0 Node 1 From Node 0 Node 1 To a c b d e f abcd GC0GC1 Ref. Q from 0 to 1 e ef

Heap Partitioning Lokesh Gidra9 = nMB Baseline design NUMA-aware space = n/2MB Chunk 0: only ¼ fullChunk 1: full Collect when full Problem: Collect more often when even 1 chunk is full = n/2MB

Heap Partitioning: Our Approach Lokesh Gidra10 Chunk 0 Chunk 1 = nMB Collect when total= nMB

Load Balancing NUMA-aware work stealing – A thread only steals from local threads on the same node. What about inter-node imbalance? – Apps with master-slave design cause this Example: h2 database Lokesh Gidra11

Lokesh Gidra12 Node 0 Node 1 From Node 0 Node 1 To a c b d GC0GC1 Ref Q from 0 to 1 Master’s stackSome slave’s stack bdac

Remote access hinders the scalability of GC. Tradeoff: Locality vs. Load Balance – Inter-node imbalance acts as a hurdle. Using all the cores is sub-optimal – Hits the memory wall. Adaptive resizing of NUMA-aware generation costs more! Up to 65% on scalable benchmarks of DaCapo. Lokesh Gidra13 Conclusion and Future Work