Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir.

Slides:

Advertisements

Similar presentations

Buffer Management The buffer manager reads disk pages into a main memory page as needed. The general idea is to minimize the amount of disk I/O by keeping.

Advertisements

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

Buffer management.

MASSIVE ARRAYS OF IDLE DISKS FOR STORAGE ARCHIVES D. Colarelli D. Grunwald U. Colorado, Boulder.

Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Jiong Xie Ph.D. Student April 2010.

Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.

Ecole Polytechnique, Nov 7, Online Job Scheduling Marek Chrobak University of California, Riverside.

Caching And Prefetching For Web Content Distribution Presented By:- Harpreet Singh Sidong Zeng ECE Fall 2007.

A Dynamic Caching Mechanism for Hadoop using Memcached Gurmeet Singh Puneet Chandra Rashid Tahir University of Illinois at Urbana Champaign Presenter:

Content Networking - CON Content Overlay Network Vishal Kumar Singh Eilon Yardeni April, 28 th 2005.

Web Server Load Balancing/Scheduling Asima Silva Tim Sutherland.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.

M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

A Distributed Paging RAM Grid System for Wide-Area Memory Sharing Rui Chu, Yongzhen Zhuang, Nong Xiao, Yunhao Liu, and Xicheng Lu Reporter : Min-Jyun Chen.

Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.

HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

MiddleMan: A Video Caching Proxy Server NOSSDAV 2000 Brian Smith Department of Computer Science Cornell University Ithaca, NY Soam Acharya Inktomi Corporation.

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

Load Rebalancing for Distributed File Systems in Clouds.

CS422 Principles of Database Systems Buffer Management Chengyu Sun California State University, Los Angeles.

COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques Dr. Xiao Qin Auburn University

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

A Fragmented Approach by Tim Micheletto. It is a way of having multiple cache servers handling data to perform a sort of load balancing It is also referred.

Characteristics Location Capacity Unit of transfer Access method Performance Physical type Physical characteristics Organisation.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.

CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.

Hippo: An Enhancement of Pipeline-aware In-memory Caching for HDFS

Clustered Web Server Model

Web Server Load Balancing/Scheduling

BD-Cache: Big Data Caching for Datacenters

Web Server Load Balancing/Scheduling

BD-CACHE Big Data Caching for Datacenters

Multilevel Memories (Improving performance using alittle “cash”)

PA an Coordinated Memory Caching for Parallel Jobs

Memory Management for Scalable Web Data Servers

Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.

Informed Prefetching and Caching

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Page Replacement.

Cooperative Caching, Simplified

CARP: Compression-Aware Replacement Policies

Contents Memory types & memory hierarchy Virtual memory (VM)

CSC3050 – Computer Architecture

CS222P: Principles of Data Management Lecture #3 Buffer Manager, PAX

Database System Architectures

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Overview Problem Solution CPU vs Memory performance imbalance

Dong Hyun Kang, Changwoo Min, Young Ik Eom

Presentation transcript:

Investigating Distributed Caching Mechanisms for Hadoop Gurmeet Singh Puneet Chandra Rashid Tahir

GOAL Explore the feasibility of a distributed caching mechanism inside Hadoop

Presentation Overview Motivation Design Experimental Results Future Work

Motivation Disk Access Times are a bottleneck in cluster computing Large amount of data is read from disk DARE RAMClouds PACMan – Coordinated Cache Replacement We want to strike a balance between RAM and Disk Storage

Our Approach Integrate Memcached with Hadoop Used Quickcached and Spymemcached Reserve a portion of the main memory at each node to serve as local cache Local caches aggregate to abstract a distributed caching mechanism governed by Memcached Greedy caching strategy Least Recently Used (LRU) cache eviction policy

Design Overview

Memcached

Design Choice 1 Simultaneous requests to Namenode and Memcached Minimizes access latency with additional network overhead

Design Choice 2 Send request to Namenode only in the case of a cache miss Minimizes network overhead with increased latency

Design Choice 3 Datanodes send requests only to Memcached Memcached checks for cached blocks If cache miss occurs, it contacts the namenode and returns the replicas addresses to the datanodes

Global Cache Replacement LRU based Global Cache Eviction Scheme

Prefetching

Simulation Results Test data ranging from 2GB to 24GB Word Count and Grep

Word Count

Grep

Future Work Implement a pre-fetching mechanism Customized caching policies based on access patterns Compare and contrast caching with locality aware scheduling

Conclusion Caching can improve the performance of cluster based systems based on the access patterns of the workload being executed