Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA

Introduction Why? – Heavy use of Managed Runtime Environments Application servers Scientific applications Example: Jboss, Sunflow etc. – Hardware is more and more multi-resourced. – GC performance is critical. – Existing GCs developed for SMPs. What? – Assess GC scalability : Empirical Results. – Possible factors affecting the GC scalability. – Our approach to fixing them. Lokesh Gidra2

Contemporary Architecture C0 C1 C5 L2 L3 DRAM C0 C1 C5 L2 L3 DRAM Our machine has 8 such nodes with 6 cores each Non Uniform Memory Access (NUMA) Remote access >> Local access Non Uniform Memory Access (NUMA) Remote access >> Local access Lokesh Gidra3 15 40 125 315 Node 0Node 1

GC Scalability (Lusearch) Pause time increases with GC threads  Negative Scalability! Lokesh Gidra4 HotSpot JVM’s Garbage Collectors Pause Time GC Threads Application Threads Application Time

Trivial Bottleneck Scalable synchronization primitives are vital. GC task queue uses a monitor – Unnecessarily blocks GC threads. Replaced with lock-free version. No barrier for GC threads after GC completion. Trivial but very important: Up to 80% improvement. Lokesh Gidra5

Main Bottleneck Remote access and … Remote access! 7 out of 8 accesses are remote – When scanning an object (87.7% remote) – When copying an object (82.7% remote) – When stealing for load balancing (2-4 bus ops/steal) Lokesh Gidra6

Our Approach: Big Picture Improve GC locality – Local Scan – Local Copy – Local Stealing Tradeoff: – Locality vs. Load Balance Fix young generation of ParallelScavenge. Lokesh Gidra7

Avoid Remote Access Lokesh Gidra8 Node 0 Node 1 From Node 0 Node 1 To a c b d e f abcd GC0GC1 Ref. Q from 0 to 1 e ef

Heap Partitioning Lokesh Gidra9 = nMB Baseline design NUMA-aware space = n/2MB Chunk 0: only ¼ fullChunk 1: full Collect when full Problem: Collect more often when even 1 chunk is full = n/2MB

Heap Partitioning: Our Approach Lokesh Gidra10 Chunk 0 Chunk 1 = nMB Collect when total= nMB

Load Balancing NUMA-aware work stealing – A thread only steals from local threads on the same node. What about inter-node imbalance? – Apps with master-slave design cause this Example: h2 database Lokesh Gidra11

Lokesh Gidra12 Node 0 Node 1 From Node 0 Node 1 To a c b d GC0GC1 Ref Q from 0 to 1 Master’s stackSome slave’s stack bdac

Remote access hinders the scalability of GC. Tradeoff: Locality vs. Load Balance – Inter-node imbalance acts as a hurdle. Using all the cores is sub-optimal – Hits the memory wall. Adaptive resizing of NUMA-aware generation costs more! Up to 65% on scalable benchmarks of DaCapo. Lokesh Gidra13 Conclusion and Future Work

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Similar presentations

Presentation on theme: "Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA.

Similar presentations

Presentation on theme: "Garbage Collection for Large Scale Multiprocessors (Funded by ANR projects: Prose and ConcoRDanT) Lokesh GidraGaël Thomas Julien SopenaMarc Shapiro Regal-LIP6/INRIA."— Presentation transcript:

Similar presentations

About project

Feedback