Download presentation
Presentation is loading. Please wait.
Published byErin Anderson Modified over 6 years ago
1
Institute of Parallel and Distributed Systems (IPADS)
Performance Analysis and Optimization of Full GC in Memory-hungry Environments Yang Yu, Tianyang Lei, Weihua Zhang, Haibo Chen, Binyu Zang Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University, China Fudan University, China VEE 2016
2
Big-data Ecosystem
3
JVM-based languages
4
Memory-hungry environments
Memory bloat phenomenon in large-scale Java applications [ISMM ’13] Limited per-application memory in a shared-cluster design inside companies like Google [EuroSys ‘13] Limited per-core memory in many-core architecture (e.g., Intel Xeon Phi)
5
Effects of Garbage Collection
GC suffers severe strain Accumulated stragglers [HOTOS ’15] Amplified tail latency [Commun. ACM] Where exactly is the bottleneck of GC in such memory-hungry environments ?
6
Parallel Scavenge in a Production JVM – HotSpot
Default garbage collector in OpenJDK 7 & 8 Stop-the-world, throughput-oriented Heap space segregated into multiple areas Young generation Old generation Permanent generation Young GC to collect young gen Full GC to collect all, mainly for old gen
7
Profiling of PS GC GC Profiling of data-intensive Java programs from JOlden Set heap size close to workload size to keep memory hungry
8
Full GC of Parallel Scavenge
A variant of Mark-Compact algorithm Slide live objects towards starting side Two bitmaps mapping the heap Heap initially segregated into multiple regions Three phases – marking, summary & compacting Bitmaps Heap
9
Decomposition of Full GC
10
Update Refs Using Bitmaps
O Source S ? N A B O Destination Updating process for a referenced live object O
11
Reference Updating Algorithm
Calculate new location that reference points to
12
Reference Updating Algorithm
Calculate new location that reference points to
13
Reference Updating Algorithm
Calculate new location that reference points to
14
Decomposition of Full GC (cont.)
We found the bottleneck !!!
15
Solution: Incremental Query
Key issue: Repeated searching range when two sequentially searched objects reside in the same region Basic idea: Reuse the result of last query (last_end_addr – beg_addr) / 2 last_beg_addr last_end_addr Last searching range Last query in Region R Matches? Same region !!! end_addr end_addr end_addr beg_addr M N Q Current query
16
Caching Types SPECjbb2015 1GB workload 10GB heap
17
Query Patterns Local pattern Random pattern
Sequentially referenced objects tend to lie in same region Results of last queries could thus be easily reused Random pattern Sequentially referenced objects always lie in random regions Incapable to reuse last results directly Most applications are mixed with two query patterns, differentiated by respective proportions
18
Optimistic IQ (1/3) A straightforward implementation Pros & cons
Complies with the basic idea Each GC thread maintains one global result of last query for all the regions Pros & cons Pros: Little overhead for both memory utilization and calculation Cons: Rely heavily on the local pattern to take good effect
19
Sort-based IQ (2/3) Dynamically reorder refs with a lazy update
References first filled into a buffer before updating Once filled up, reorder refs based on region indexes Buffer size close to L1 cache line size Pros & cons Pros: Gather refs in same region periodically Cons: Calculation overhead due to the extra sorting procedure
20
Region-based IQ (3/3) Maintain the result of last query for each region per GC thread Fit for both local and random query patterns A Slicing scheme – divide each region into multiple slices, maintaining last result for each slice More aggressive Minimize memory overhead 16-bit integer to store calculated size of live objects Offset instead of full-length address for last queried object Reduced to 0.09% of heap size with one slice per GC thread
21
Experimental environments
Parameter Intel(R) Xeon(R) CPU E5-2620 Intel Xeon PhiTM Coprocessor 5110P Chips 1 Core type Out-of-order In-order Physical cores 6 60 Frequency 2.00 GHz MHz Data caches 32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared 32 KB L1, 512 KB L2 per core Memory capacity 32 GB 7697 MB Memory Technology DDR3 GDDR5 Memory Access Latency 140 cycles 340 cycles
22
Experimental environments (cont.)
JOlden + GCBench + Dacapo + SPECjvm Spark + Giraph (X.v & C.c refer to Xml.validation & Compiler.compiler) OpenJDK 7u + HotSpot JVM
23
Speedup of Full GC Thru. on CPU
1.99x 1.94x Comparison of 3 query schemes and OpenJDK 8 with 1&6 GC threads
24
Improvement of App. Thru. on CPU
%19.3 With 6 GC threads using region-based IQ
25
Speedup on Xeon Phi 2.22x 2.08x 11.1% Speedup of full GC & app. thru. with 1&20 GC threads using region-based IQ
26
Reduction in Pause Time
%31.2 %34.9 Normalized elapsed time of full GC & total pause. Lower is better
27
Speedup for Big-data on CPU
Speedup of full GC & app. thru. using region-based IQ with varying input and heap sizes
28
Conclusions A thorough profiling-based analysis of Parallel Scavenge in a production JVM – HotSpot An incremental query model and three different schemes Integrated into OpenJDK main stream JDK
29
Thanks Questions
30
Backups
31
Port of Region-based IQ to OpenJDK 8
Speedup of full GC thru. of region-based IQ on JDK 8
32
Evaluation on Clusters
Orthogonal to distributed execution A small-scale evaluation on a 5-node cluster, each with two 10-core Intel Xeon E v3 processors and 64GB DRAM Run Spark PageRank with 100 million edges input and 10GB heap size on each node Record accumulated full GC time for all nodes and elapsed application time on master 63.8% and 7.3% improvement for full GC and application throughput, respectively Smaller speedup due to network communication becomes a more dominating factor during distributed execution
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.