Assessing the Scalability of Garbage Collectors on Many Cores (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc.

Slides:

Advertisements

Similar presentations

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland.

Using Parallel Genetic Algorithm in a Predictive Job Scheduling

Paraglide Martin Vechev Eran Yahav Martin Vechev Eran Yahav.

Geography coursework: Evaluation

The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.

QoS Impact on User Perception and Understanding of Multimedia Video Clips G. Ghinea and J.P. Thomas Department of Computer Science University of Reading,

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Work Stealing for Irregular Parallel Applications on Computational Grids Vladimir Janjic University of St Andrews 12th December 2011.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.

Introduction to MIMD architectures

ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.

Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.

Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.

Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.

MOSTLY PARALLEL GARBAGE COLLECTION Authors : Hans J. Boehm Alan J. Demers Scott Shenker XEROX PARC Presented by:REVITAL SHABTAI.

Incremental Garbage Collection

Performance Engineering and Debugging HPC Applications David Skinner

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Optimizing RAM-latency Dominated Applications

CLR: Garbage Collection Inside Out

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Taking Off The Gloves With Reference Counting Immix

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

1 The Google File System Reporter: You-Wei Zhang.

Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.

1 Data layouts for object-oriented programs Martin Hirzel IBM Research SIGMETRICS 6/16/2007.

Incremental Garbage Collection Uwe Kern 23. Januar 2002

Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

Runtime System CS 153: Compilers. Runtime System Runtime system: all the stuff that the language implicitly assumes and that is not described in the program.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Multitasking without Compromise: a Virtual Machine Evolution

Java 9: The Quest for Very Large Heaps

A task-based implementation for GeantV

The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.

CMSC 611: Advanced Computer Architecture

NumaGiC: A garbage collector for big-data on big NUMA machines

David F. Bacon, Perry Cheng, and V.T. Rajan

Using Packet Information for Efficient Communication in NoCs

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

List Processing in Real Time on a Serial Computer

José A. Joao* Onur Mutlu‡ Yale N. Patt*

High Performance Computing

Continuously and Compacting By Lyndon Meadow

Tim Harris (MSR Cambridge)

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Presentation transcript:

Assessing the Scalability of Garbage Collectors on Many Cores (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA

Introduction Why? – MREs are ubiquitous! – GC, a vital component of it  performance is critical? – Hardware is more and more multi-resourced. – Are GCs scaling with such hardware? – Current solutions not evaluated on true many-cores! What? – Assesses GC scalability : Empirical Results. – Possible factors affecting the GC scalability. Lokesh Gidra2

Multi-Node Architecture C0 C1 C5 L2 L3 MC DRAM C0 C1 C5 L2 L3 MC DRAM Our machine has 8 nodes with 6 cores each Remote access >> Local access To other nodes Lokesh Gidra

Parallel Copying Garbage Collection Pause Time Application Time Mutator Threads GC Threads From SpaceTo Space Live Object Dead Object Total Time Lokesh Gidra4

GCs effect on Application Scalability (Lusearch) Up-to 6 cores: 3X performance improvement. More than 6 cores: No improvement in total time. Proportion of pause time increases up-to 50%. Lokesh Gidra5 Mutator Threads = GC Threads = Varying Number of Cores

GC Scalability (Lusearch) Pause time increases with GC threads  Negative Scalability! Lokesh Gidra6 Mutator Threads = Cores = 48 and, Varying Number of GC Threads

1. Remote Scanning From SpaceTo Space Live Object Dead Object Node 0 Node 1 Node 2 Node 3 GC Threads GC0GC1 GC2GC3 Lokesh Gidra7 87.7% scans were remote! Random (Default) object allocation

2. Remote Copying Node 0 Node 1 Node 2 Node 3 GC Threads From SpaceTo Space Live Object Dead Object GC0GC1 GC2GC3 Lokesh Gidra8 82.7% copies were remote!

3. Load Balancing Task Queue Owner: Push and Pop Other GC Threads: Steal (Pop) Based on work stealing technique. 1 task queue per GC thread. Highly unbalanced load: Requires a lot of stealing. Keep doing until all are done. Performance Impact: ≥ 2-4 cache misses/stealing! 33.3% improvement in pause time by disabling it! Shared Variable: size (task queue size) Lokesh Gidra9

Conclusion GC does affect application’s scalability  it matters! GC doesn’t scale with the hardware! Bottlenecks: – Remote Scanning – Remote Copying – Load Balancing Future Work: – Fix the bottlenecks  does it help GC to scale? Lokesh Gidra10

DaCapo Benchmarks’ Scalability Lokesh Gidra11

Revisiting App. (Lusearch) Scalability… Lokesh Gidra12