Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA
Introduction Why? – Heavy use of Managed Runtime Environments Application servers Scientific applications Example: Jboss, Sunflow etc. – Hardware is more and more multi-resourced. – GC performance is critical. – Existing GCs developed for SMPs. What? – Assess GC scalability : Empirical Results. – Possible factors affecting the GC scalability. – Our approach to fixing them. Lokesh Gidra2
Contemporary Architecture C0 C1 C5 L2 L3 DRAM C0 C1 C5 L2 L3 DRAM Our machine has 8 such nodes with 6 cores each Non Uniform Memory Access (NUMA) Remote access >> Local access Non Uniform Memory Access (NUMA) Remote access >> Local access Lokesh Gidra Node 0Node 1
GC Scalability (Lusearch) Pause time increases with GC threads Negative Scalability! Lokesh Gidra4 HotSpot JVM’s Garbage Collectors Pause Time GC Threads Application Threads Application Time
Trivial Bottleneck Scalable synchronization primitives are vital. GC task queue uses a monitor – Unnecessarily blocks GC threads. Replaced with lock-free version. No barrier for GC threads after GC completion. Trivial but very important: Up to 80% improvement. Lokesh Gidra5
Main Bottleneck Remote access and … Remote access! 7 out of 8 accesses are remote – When scanning an object (87.7% remote) – When copying an object (82.7% remote) – When stealing for load balancing (2-4 bus ops/steal) Lokesh Gidra6
Our Approach: Big Picture Improve GC locality – Local Scan – Local Copy – Local Stealing Tradeoff: – Locality vs. Load Balance Fix young generation of ParallelScavenge. Lokesh Gidra7
Avoid Remote Access Lokesh Gidra8 Node 0 Node 1 From Node 0 Node 1 To a c b d e f abcd GC0GC1 Ref. Q from 0 to 1 e ef
Heap Partitioning Lokesh Gidra9 = nMB Baseline design NUMA-aware space = n/2MB Chunk 0: only ¼ fullChunk 1: full Collect when full Problem: Collect more often when even 1 chunk is full = n/2MB
Heap Partitioning: Our Approach Lokesh Gidra10 Chunk 0 Chunk 1 = nMB Collect when total= nMB
Load Balancing NUMA-aware work stealing – A thread only steals from local threads on the same node. What about inter-node imbalance? – Apps with master-slave design cause this Example: h2 database Lokesh Gidra11
Lokesh Gidra12 Node 0 Node 1 From Node 0 Node 1 To a c b d GC0GC1 Ref Q from 0 to 1 Master’s stackSome slave’s stack bdac
Remote access hinders the scalability of GC. Tradeoff: Locality vs. Load Balance – Inter-node imbalance acts as a hurdle. Using all the cores is sub-optimal – Hits the memory wall. Adaptive resizing of NUMA-aware generation costs more! Up to 65% on scalable benchmarks of DaCapo. Lokesh Gidra13 Conclusion and Future Work