Download presentation
Presentation is loading. Please wait.
Published byBraulio Whidby Modified over 10 years ago
1
Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA
2
Introduction Why? – Heavy use of Managed Runtime Environments Application servers Scientific applications Example: Jboss, Sunflow etc. – Hardware is more and more multi-resourced. – GC performance is critical. – Existing GCs developed for SMPs. What? – Assess GC scalability : Empirical Results. – Possible factors affecting the GC scalability. – Our approach to fixing them. Lokesh Gidra2
3
Contemporary Architecture C0 C1 C5 L2 L3 DRAM C0 C1 C5 L2 L3 DRAM Our machine has 8 such nodes with 6 cores each Non Uniform Memory Access (NUMA) Remote access >> Local access Non Uniform Memory Access (NUMA) Remote access >> Local access Lokesh Gidra3 15 40 125 315 Node 0Node 1
4
GC Scalability (Lusearch) Pause time increases with GC threads Negative Scalability! Lokesh Gidra4 HotSpot JVM’s Garbage Collectors Pause Time GC Threads Application Threads Application Time
5
Trivial Bottleneck Scalable synchronization primitives are vital. GC task queue uses a monitor – Unnecessarily blocks GC threads. Replaced with lock-free version. No barrier for GC threads after GC completion. Trivial but very important: Up to 80% improvement. Lokesh Gidra5
6
Main Bottleneck Remote access and … Remote access! 7 out of 8 accesses are remote – When scanning an object (87.7% remote) – When copying an object (82.7% remote) – When stealing for load balancing (2-4 bus ops/steal) Lokesh Gidra6
7
Our Approach: Big Picture Improve GC locality – Local Scan – Local Copy – Local Stealing Tradeoff: – Locality vs. Load Balance Fix young generation of ParallelScavenge. Lokesh Gidra7
8
Avoid Remote Access Lokesh Gidra8 Node 0 Node 1 From Node 0 Node 1 To a c b d e f abcd GC0GC1 Ref. Q from 0 to 1 e ef
9
Heap Partitioning Lokesh Gidra9 = nMB Baseline design NUMA-aware space = n/2MB Chunk 0: only ¼ fullChunk 1: full Collect when full Problem: Collect more often when even 1 chunk is full = n/2MB
10
Heap Partitioning: Our Approach Lokesh Gidra10 Chunk 0 Chunk 1 = nMB Collect when total= nMB
11
Load Balancing NUMA-aware work stealing – A thread only steals from local threads on the same node. What about inter-node imbalance? – Apps with master-slave design cause this Example: h2 database Lokesh Gidra11
12
Lokesh Gidra12 Node 0 Node 1 From Node 0 Node 1 To a c b d GC0GC1 Ref Q from 0 to 1 Master’s stackSome slave’s stack bdac
13
Remote access hinders the scalability of GC. Tradeoff: Locality vs. Load Balance – Inter-node imbalance acts as a hurdle. Using all the cores is sub-optimal – Hits the memory wall. Adaptive resizing of NUMA-aware generation costs more! Up to 65% on scalable benchmarks of DaCapo. Lokesh Gidra13 Conclusion and Future Work
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.