Institute of Parallel and Distributed Systems (IPADS)

Institute of Parallel and Distributed Systems (IPADS)
Performance Analysis and Optimization of Full GC in Memory-hungry Environments Yang Yu, Tianyang Lei, Weihua Zhang, Haibo Chen, Binyu Zang Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University, China Fudan University, China VEE 2016

Big-data Ecosystem

JVM-based languages

Memory-hungry environments
Memory bloat phenomenon in large-scale Java applications [ISMM ’13] Limited per-application memory in a shared-cluster design inside companies like Google [EuroSys ‘13] Limited per-core memory in many-core architecture (e.g., Intel Xeon Phi)

Effects of Garbage Collection
GC suffers severe strain Accumulated stragglers [HOTOS ’15] Amplified tail latency [Commun. ACM] Where exactly is the bottleneck of GC in such memory-hungry environments ?

Parallel Scavenge in a Production JVM – HotSpot
Default garbage collector in OpenJDK 7 & 8 Stop-the-world, throughput-oriented Heap space segregated into multiple areas Young generation Old generation Permanent generation Young GC to collect young gen Full GC to collect all, mainly for old gen

Profiling of PS GC GC Profiling of data-intensive Java programs from JOlden Set heap size close to workload size to keep memory hungry

Full GC of Parallel Scavenge
A variant of Mark-Compact algorithm Slide live objects towards starting side Two bitmaps mapping the heap Heap initially segregated into multiple regions Three phases – marking, summary & compacting Bitmaps Heap

Decomposition of Full GC

Update Refs Using Bitmaps
O Source S ? N A B O Destination Updating process for a referenced live object O

Reference Updating Algorithm
Calculate new location that reference points to

Decomposition of Full GC (cont.)
We found the bottleneck !!!

Solution: Incremental Query
Key issue: Repeated searching range when two sequentially searched objects reside in the same region Basic idea: Reuse the result of last query (last_end_addr – beg_addr) / 2 last_beg_addr last_end_addr Last searching range Last query in Region R Matches? Same region !!! end_addr end_addr end_addr beg_addr M N Q Current query

Caching Types SPECjbb2015 1GB workload 10GB heap

Query Patterns Local pattern Random pattern
Sequentially referenced objects tend to lie in same region Results of last queries could thus be easily reused Random pattern Sequentially referenced objects always lie in random regions Incapable to reuse last results directly Most applications are mixed with two query patterns, differentiated by respective proportions

Optimistic IQ (1/3) A straightforward implementation Pros & cons
Complies with the basic idea Each GC thread maintains one global result of last query for all the regions Pros & cons Pros: Little overhead for both memory utilization and calculation Cons: Rely heavily on the local pattern to take good effect

Sort-based IQ (2/3) Dynamically reorder refs with a lazy update
References first filled into a buffer before updating Once filled up, reorder refs based on region indexes Buffer size close to L1 cache line size Pros & cons Pros: Gather refs in same region periodically Cons: Calculation overhead due to the extra sorting procedure

Region-based IQ (3/3) Maintain the result of last query for each region per GC thread Fit for both local and random query patterns A Slicing scheme – divide each region into multiple slices, maintaining last result for each slice More aggressive Minimize memory overhead 16-bit integer to store calculated size of live objects Offset instead of full-length address for last queried object Reduced to 0.09% of heap size with one slice per GC thread

Experimental environments
Parameter Intel(R) Xeon(R) CPU E5-2620 Intel Xeon PhiTM Coprocessor 5110P Chips 1 Core type Out-of-order In-order Physical cores 6 60 Frequency 2.00 GHz MHz Data caches 32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared 32 KB L1, 512 KB L2 per core Memory capacity 32 GB 7697 MB Memory Technology DDR3 GDDR5 Memory Access Latency 140 cycles 340 cycles

Experimental environments (cont.)
JOlden + GCBench + Dacapo + SPECjvm Spark + Giraph (X.v & C.c refer to Xml.validation & Compiler.compiler) OpenJDK 7u + HotSpot JVM

Speedup of Full GC Thru. on CPU
1.99x 1.94x Comparison of 3 query schemes and OpenJDK 8 with 1&6 GC threads

Improvement of App. Thru. on CPU
%19.3 With 6 GC threads using region-based IQ

Speedup on Xeon Phi 2.22x 2.08x 11.1% Speedup of full GC & app. thru. with 1&20 GC threads using region-based IQ

Reduction in Pause Time
%31.2 %34.9 Normalized elapsed time of full GC & total pause. Lower is better

Speedup for Big-data on CPU
Speedup of full GC & app. thru. using region-based IQ with varying input and heap sizes

Conclusions A thorough profiling-based analysis of Parallel Scavenge in a production JVM – HotSpot An incremental query model and three different schemes Integrated into OpenJDK main stream JDK

Thanks Questions

Backups

Port of Region-based IQ to OpenJDK 8
Speedup of full GC thru. of region-based IQ on JDK 8

Evaluation on Clusters
Orthogonal to distributed execution A small-scale evaluation on a 5-node cluster, each with two 10-core Intel Xeon E v3 processors and 64GB DRAM Run Spark PageRank with 100 million edges input and 10GB heap size on each node Record accumulated full GC time for all nodes and elapsed application time on master 63.8% and 7.3% improvement for full GC and application throughput, respectively Smaller speedup due to network communication becomes a more dominating factor during distributed execution

Institute of Parallel and Distributed Systems (IPADS)

Similar presentations

Presentation on theme: "Institute of Parallel and Distributed Systems (IPADS)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Institute of Parallel and Distributed Systems (IPADS)

Similar presentations

Presentation on theme: "Institute of Parallel and Distributed Systems (IPADS)"— Presentation transcript:

Similar presentations

About project

Feedback