Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir
GOAL Explore the feasibility of a distributed caching mechanism inside Hadoop
Presentation Overview Motivation Design Experimental Results Future Work
Motivation Disk Access Times are a bottleneck in cluster computing Large amount of data is read from disk DARE RAMClouds PACMan – Coordinated Cache Replacement We want to strike a balance between RAM and Disk Storage
Our Approach Integrate Memcached with Hadoop Used Quickcached and Spymemcached Reserve a portion of the main memory at each node to serve as local cache Local caches aggregate to abstract a distributed caching mechanism governed by Memcached Greedy caching strategy Least Recently Used (LRU) cache eviction policy
Design Overview
Memcached
Design Choice 1 Simultaneous requests to Namenode and Memcached Minimizes access latency with additional network overhead
Design Choice 2 Send request to Namenode only in the case of a cache miss Minimizes network overhead with increased latency
Design Choice 3 Datanodes send requests only to Memcached Memcached checks for cached blocks If cache miss occurs, it contacts the namenode and returns the replicas addresses to the datanodes
Global Cache Replacement LRU based Global Cache Eviction Scheme
Prefetching
Simulation Results Test data ranging from 2GB to 24GB Word Count and Grep
Word Count
Grep
Future Work Implement a pre-fetching mechanism Customized caching policies based on access patterns Compare and contrast caching with locality aware scheduling
Conclusion Caching can improve the performance of cluster based systems based on the access patterns of the workload being executed