Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing RAM-latency Dominated Applications

Similar presentations


Presentation on theme: "Optimizing RAM-latency Dominated Applications"— Presentation transcript:

1 Optimizing RAM-latency Dominated Applications
Yandong Mao, Cody Cutler, Robert Morris MIT CSAIL

2 RAM-latency may dominate performance
RAM-latency dominated applications follow long pointer chains working set >> on-chip cache A lot of cache misses -> stalling on RAM fetches Example: Garbage Collector Identify live objects by following inter-object pointers Spend much of its time stalling to follow pointers, due to RAM latency

3 Addressing RAM-latency bottleneck?
View RAM as we view disk High latency A similar set of optimization techniques Batching Sorting Access I/O in parallel and asynchronously

4 Outline Hardware Background Three techniques to address RAM-latency
Linearization: Garbage Collector Interleaving: Masstree Parallelization: Masstree Discussion Conclusion

5 Three Relevant Hardware Features
Intel Xeon X5690 1. Fetch RAM before needed Hardware prefetcher – sequential or strided access pattern Software prefetch Out-of-order execution 1 2 3 4 5 RAM Controller 2. Parallel accesses to different channels Channel 0 Channel 1 Channel 2 3. Row buffer cache inside memory channel

6 Per-array row buffer cache
Each channel has many of arrays shown below Each array has an additional row: row buffer Memory access: check row buffer, reload if miss ... . Data Rows Hit in row buffer: 2x-5x faster than miss! Sequential access: 3.5x higher throughput than random access! Row buffer ... 4096 bytes

7 Linearization memory accesses for Garbage Collector
Garbage Collector goal Find live objects (tracing) starts from root (stack, global variables) follows object pointers of live objects Claim space for unreachable objects Bottleneck of tracing: RAM-latency Pointer addresses are unpredictable and non-sequential Each access -> cache miss -> stall for RAM-fetch

8 Observation Arrange objects in tracing order during garbage collection
Subsequent tracing would access memory in sequential order Take advantage of two hardware features Hardware prefechers: prefetch into cache Higher row buffer hit rate

9 Benchmark and result Time of tracing 1.8 GB of live data
HSQLDB 2.2.9: a RDBMS engine in Java Compacting Collector of Hotspot JVM from OpenJDK7u6 Use copy collection to reorder objects in tracing order Result: tracing in sequential order is 1.3X faster than random order Future work better linearizing algorithm than copy collection algorithm (use twice the memory!) measure application-level performance improvement

10 Interleaving on Masstree
Not always possible to linearize memory access Masstree: a high performance in-memory key value store for multi-core All cores share a single B+tree Each core: a dedicated working thread Scales well on multi-core Focus on Masstree with single-thread for now

11 Single-threaded Masstree is RAM-latency dominated
Careful design to avoid RAM fetches trie of B+trees, inline key fragments and children in tree nodes Accessing one fat B+tree node in one RAM-latency Still RAM-latency dominated! Each key-lookup follows a random path O(N) RAM-latency (hundreds of cycles) per-lookup A million lookups per second

12 Batch and interleave tree lookups
Batch key lookups Interleave computation and RAM fetch using software prefetch

13 Perform a batch of lookups w/o stalling on RAM-fetch!
Find child containing A in E prefetch(B) 2. Find child containing X in E prefetch(F) E 3. Find child containing A in B prefetch(A) B is already in cache! B F 4. Find child containing X in F prefetch(X) F is already in cache! D X A Perform a batch of lookups w/o stalling on RAM-fetch! As long as computation (inspecting a batch of nodes) > RAM-latency 30% improvement with batch of five

14 Parallelizing Masstree
Interesting observation applications are limited by RAM-latency, not by CPU but adding more cores help! Reason RAM is a parallel system More cores keeps RAM busier Compare with interleaving technique Same effect: keep RAM busier Difference: from one core, and from multi-cores

15 Parallelization improves performance by issuing more RAM loads

16 Interleaving and Parallelization can be complementary
Beats Masstree by 12-30% Improvement decreases with more cores Parallelization alone can saturate

17 Discussion Applicability Lessons Challenges in automatic interleaving
Interleaving seems more general than linearization applied to Garbage Collector? Interleaving is more difficult than parallelization requires batching and concurrency control Challenges in automatic interleaving Need to identify and resolve conflicting access Difficult or impossible without programmers’ help

18 Discussion Interleaving on certain data structures
Data structures and potential applications B+tree: Masstree other applications use in-memory B+tree? Hashtable: Memcached A single hashtable Multi-get API: natural batching and interleaving Preliminary result: interleaving hashtable improves throughput by 1.3X

19 Discussion Profiling tools Linux perf Maybe misleading
Look at most expensive function Manually inspect Maybe misleading computation limited or RAM-latency limited? RAM stalls based tool?

20 Related Work PALM[Jason11]: B+tree with same interleaving technique
RAM parallelization at different levels: regulation considered harmful[Park13]

21 Conclusion Identifies a class of applications: dominated by RAM-latency Three techniques to address RAM-latency bottleneck of two applications Improve your program similarly?

22 Questions?

23 Single-threaded Masstree is RAM-latency dominated
B+tree, indexed by k[0:7] Trie: a tree where each level is indexed by fixed-length key fragment B+tree, indexed by k[8:15]


Download ppt "Optimizing RAM-latency Dominated Applications"

Similar presentations


Ads by Google