Institute of Parallel and Distributed Systems (IPADS) Performance Analysis and Optimization of Full GC in Memory-hungry Environments Yang Yu, Tianyang Lei, Weihua Zhang, Haibo Chen, Binyu Zang Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University, China Fudan University, China VEE 2016
Big-data Ecosystem
JVM-based languages
Memory-hungry environments Memory bloat phenomenon in large-scale Java applications [ISMM ’13] Limited per-application memory in a shared-cluster design inside companies like Google [EuroSys ‘13] Limited per-core memory in many-core architecture (e.g., Intel Xeon Phi)
Effects of Garbage Collection GC suffers severe strain Accumulated stragglers [HOTOS ’15] Amplified tail latency [Commun. ACM] Where exactly is the bottleneck of GC in such memory-hungry environments ?
Parallel Scavenge in a Production JVM – HotSpot Default garbage collector in OpenJDK 7 & 8 Stop-the-world, throughput-oriented Heap space segregated into multiple areas Young generation Old generation Permanent generation Young GC to collect young gen Full GC to collect all, mainly for old gen
Profiling of PS GC GC Profiling of data-intensive Java programs from JOlden Set heap size close to workload size to keep memory hungry
Full GC of Parallel Scavenge A variant of Mark-Compact algorithm Slide live objects towards starting side Two bitmaps mapping the heap Heap initially segregated into multiple regions Three phases – marking, summary & compacting Bitmaps Heap
Decomposition of Full GC
Update Refs Using Bitmaps O Source S ? N A B O Destination Updating process for a referenced live object O
Reference Updating Algorithm Calculate new location that reference points to
Reference Updating Algorithm Calculate new location that reference points to
Reference Updating Algorithm Calculate new location that reference points to
Decomposition of Full GC (cont.) We found the bottleneck !!!
Solution: Incremental Query Key issue: Repeated searching range when two sequentially searched objects reside in the same region Basic idea: Reuse the result of last query (last_end_addr – beg_addr) / 2 last_beg_addr last_end_addr Last searching range Last query in Region R Matches? Same region !!! end_addr end_addr end_addr beg_addr M N Q Current query
Caching Types SPECjbb2015 1GB workload 10GB heap
Query Patterns Local pattern Random pattern Sequentially referenced objects tend to lie in same region Results of last queries could thus be easily reused Random pattern Sequentially referenced objects always lie in random regions Incapable to reuse last results directly Most applications are mixed with two query patterns, differentiated by respective proportions
Optimistic IQ (1/3) A straightforward implementation Pros & cons Complies with the basic idea Each GC thread maintains one global result of last query for all the regions Pros & cons Pros: Little overhead for both memory utilization and calculation Cons: Rely heavily on the local pattern to take good effect
Sort-based IQ (2/3) Dynamically reorder refs with a lazy update References first filled into a buffer before updating Once filled up, reorder refs based on region indexes Buffer size close to L1 cache line size Pros & cons Pros: Gather refs in same region periodically Cons: Calculation overhead due to the extra sorting procedure
Region-based IQ (3/3) Maintain the result of last query for each region per GC thread Fit for both local and random query patterns A Slicing scheme – divide each region into multiple slices, maintaining last result for each slice More aggressive Minimize memory overhead 16-bit integer to store calculated size of live objects Offset instead of full-length address for last queried object Reduced to 0.09% of heap size with one slice per GC thread
Experimental environments Parameter Intel(R) Xeon(R) CPU E5-2620 Intel Xeon PhiTM Coprocessor 5110P Chips 1 Core type Out-of-order In-order Physical cores 6 60 Frequency 2.00 GHz 1052.63 MHz Data caches 32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared 32 KB L1, 512 KB L2 per core Memory capacity 32 GB 7697 MB Memory Technology DDR3 GDDR5 Memory Access Latency 140 cycles 340 cycles
Experimental environments (cont.) JOlden + GCBench + Dacapo + SPECjvm2008 + Spark + Giraph (X.v & C.c refer to Xml.validation & Compiler.compiler) OpenJDK 7u + HotSpot JVM
Speedup of Full GC Thru. on CPU 1.99x 1.94x Comparison of 3 query schemes and OpenJDK 8 with 1&6 GC threads
Improvement of App. Thru. on CPU %19.3 With 6 GC threads using region-based IQ
Speedup on Xeon Phi 2.22x 2.08x 11.1% Speedup of full GC & app. thru. with 1&20 GC threads using region-based IQ
Reduction in Pause Time %31.2 %34.9 Normalized elapsed time of full GC & total pause. Lower is better
Speedup for Big-data on CPU Speedup of full GC & app. thru. using region-based IQ with varying input and heap sizes
Conclusions A thorough profiling-based analysis of Parallel Scavenge in a production JVM – HotSpot An incremental query model and three different schemes Integrated into OpenJDK main stream JDK-8146987
Thanks Questions
Backups
Port of Region-based IQ to OpenJDK 8 Speedup of full GC thru. of region-based IQ on JDK 8
Evaluation on Clusters Orthogonal to distributed execution A small-scale evaluation on a 5-node cluster, each with two 10-core Intel Xeon E5-2650 v3 processors and 64GB DRAM Run Spark PageRank with 100 million edges input and 10GB heap size on each node Record accumulated full GC time for all nodes and elapsed application time on master 63.8% and 7.3% improvement for full GC and application throughput, respectively Smaller speedup due to network communication becomes a more dominating factor during distributed execution