Institute of Parallel and Distributed Systems (IPADS)

Slides:

Advertisements

Similar presentations

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.

Advertisements

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Garbage Collection What is garbage and how can we deal with it?

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

MC 2 : High Performance GC for Memory-Constrained Environments - Narendran Sachindran, J. Eliot B. Moss, Emery D. Berger Sowmiya Chocka Narayanan.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Mark and Split Kostis Sagonas Uppsala Univ., Sweden NTUA, Greece Jesper Wilhelmsson Uppsala Univ., Sweden.

Optimizing RAM-latency Dominated Applications

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

DFTL: A flash translation layer employing demand-based selective caching of page-level address mappings A. gupta, Y. Kim, B. Urgaonkar, Penn State ASPLOS.

Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1

Presented by: Omar Alqahtani Fall 2016

NFV Compute Acceleration APIs and Evaluation

Garbage Collection What is garbage and how can we deal with it?

Nios II Processor: Memory Organization and Access

Seth Pugsley, Jeffrey Jestes,

ECE232: Hardware Organization and Design

Adaptive Cache Partitioning on a Composite Core

A Closer Look at Instruction Set Architectures

Java 9: The Quest for Very Large Heaps

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Basic Performance Parameters in Computer Architecture:

Sub-millisecond Stateful Stream Querying over

Cache Memory Presentation I

PA an Coordinated Memory Caching for Parallel Jobs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

IEEE BigData 2016 December 5-8, Washington D.C.

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

NumaGiC: A garbage collector for big-data on big NUMA machines

Linchuan Chen, Peng Jiang and Gagan Agrawal

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

Computer Evolution and Performance

Experiences with Hadoop and MapReduce

MAPREDUCE TYPES, FORMATS AND FEATURES

Lecture 7: Flexible Address Translation

CENG 351 Data Management and File Structures

Fast Accesses to Big Data in Memory and Storage Systems

Rajeev Balasubramonian

Lei Zhao, Youtao Zhang, Jun Yang

Garbage Collection What is garbage and how can we deal with it?

A Closer Look at NFV Execution Models

Virtual Memory 1 1.

Presentation transcript:

Institute of Parallel and Distributed Systems (IPADS) Performance Analysis and Optimization of Full GC in Memory-hungry Environments Yang Yu, Tianyang Lei, Weihua Zhang, Haibo Chen, Binyu Zang Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University, China Fudan University, China VEE 2016

Big-data Ecosystem

JVM-based languages

Memory-hungry environments Memory bloat phenomenon in large-scale Java applications [ISMM ’13] Limited per-application memory in a shared-cluster design inside companies like Google [EuroSys ‘13] Limited per-core memory in many-core architecture (e.g., Intel Xeon Phi)

Effects of Garbage Collection GC suffers severe strain Accumulated stragglers [HOTOS ’15] Amplified tail latency [Commun. ACM] Where exactly is the bottleneck of GC in such memory-hungry environments ?

Parallel Scavenge in a Production JVM – HotSpot Default garbage collector in OpenJDK 7 & 8 Stop-the-world, throughput-oriented Heap space segregated into multiple areas Young generation Old generation Permanent generation Young GC to collect young gen Full GC to collect all, mainly for old gen

Profiling of PS GC GC Profiling of data-intensive Java programs from JOlden Set heap size close to workload size to keep memory hungry

Full GC of Parallel Scavenge A variant of Mark-Compact algorithm Slide live objects towards starting side Two bitmaps mapping the heap Heap initially segregated into multiple regions Three phases – marking, summary & compacting Bitmaps Heap

Decomposition of Full GC

Update Refs Using Bitmaps O Source S ? N A B O Destination Updating process for a referenced live object O

Reference Updating Algorithm Calculate new location that reference points to

Reference Updating Algorithm Calculate new location that reference points to

Reference Updating Algorithm Calculate new location that reference points to

Decomposition of Full GC (cont.) We found the bottleneck !!!

Solution: Incremental Query Key issue: Repeated searching range when two sequentially searched objects reside in the same region Basic idea: Reuse the result of last query (last_end_addr – beg_addr) / 2 last_beg_addr last_end_addr Last searching range Last query in Region R Matches? Same region !!! end_addr end_addr end_addr beg_addr M N Q Current query

Caching Types SPECjbb2015 1GB workload 10GB heap

Query Patterns Local pattern Random pattern Sequentially referenced objects tend to lie in same region Results of last queries could thus be easily reused Random pattern Sequentially referenced objects always lie in random regions Incapable to reuse last results directly Most applications are mixed with two query patterns, differentiated by respective proportions

Optimistic IQ (1/3) A straightforward implementation Pros & cons Complies with the basic idea Each GC thread maintains one global result of last query for all the regions Pros & cons Pros: Little overhead for both memory utilization and calculation Cons: Rely heavily on the local pattern to take good effect

Sort-based IQ (2/3) Dynamically reorder refs with a lazy update References first filled into a buffer before updating Once filled up, reorder refs based on region indexes Buffer size close to L1 cache line size Pros & cons Pros: Gather refs in same region periodically Cons: Calculation overhead due to the extra sorting procedure

Region-based IQ (3/3) Maintain the result of last query for each region per GC thread Fit for both local and random query patterns A Slicing scheme – divide each region into multiple slices, maintaining last result for each slice More aggressive Minimize memory overhead 16-bit integer to store calculated size of live objects Offset instead of full-length address for last queried object Reduced to 0.09% of heap size with one slice per GC thread

Experimental environments Parameter Intel(R) Xeon(R) CPU E5-2620 Intel Xeon PhiTM Coprocessor 5110P Chips 1 Core type Out-of-order In-order Physical cores 6 60 Frequency 2.00 GHz 1052.63 MHz Data caches 32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared 32 KB L1, 512 KB L2 per core Memory capacity 32 GB 7697 MB Memory Technology DDR3 GDDR5 Memory Access Latency 140 cycles 340 cycles

Experimental environments (cont.) JOlden + GCBench + Dacapo + SPECjvm2008 + Spark + Giraph (X.v & C.c refer to Xml.validation & Compiler.compiler) OpenJDK 7u + HotSpot JVM

Speedup of Full GC Thru. on CPU 1.99x 1.94x Comparison of 3 query schemes and OpenJDK 8 with 1&6 GC threads

Improvement of App. Thru. on CPU %19.3 With 6 GC threads using region-based IQ

Speedup on Xeon Phi 2.22x 2.08x 11.1% Speedup of full GC & app. thru. with 1&20 GC threads using region-based IQ

Reduction in Pause Time %31.2 %34.9 Normalized elapsed time of full GC & total pause. Lower is better

Speedup for Big-data on CPU Speedup of full GC & app. thru. using region-based IQ with varying input and heap sizes

Conclusions A thorough profiling-based analysis of Parallel Scavenge in a production JVM – HotSpot An incremental query model and three different schemes Integrated into OpenJDK main stream JDK-8146987

Thanks Questions

Backups

Port of Region-based IQ to OpenJDK 8 Speedup of full GC thru. of region-based IQ on JDK 8

Evaluation on Clusters Orthogonal to distributed execution A small-scale evaluation on a 5-node cluster, each with two 10-core Intel Xeon E5-2650 v3 processors and 64GB DRAM Run Spark PageRank with 100 million edges input and 10GB heap size on each node Record accumulated full GC time for all nodes and elapsed application time on master 63.8% and 7.3% improvement for full GC and application throughput, respectively Smaller speedup due to network communication becomes a more dominating factor during distributed execution