Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -

Slides:

Advertisements

Similar presentations

Introduction to Data Center Computing Derek Murray October 2010.

Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Scalable High Performance Dimension Reduction

SALSA HPC Group School of Informatics and Computing Indiana University.

Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.

SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Scalable Parallel Computing on Clouds Thilina Gunarathne Advisor : Prof.Geoffrey Fox Committee : Prof.Judy Qiu,

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Final Project: Video Transcoding on Cloud Environments Queenie Wong CMPT 880.

SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

10 -1  The Term Project demands in-depth research and investigated reporting. All reported contents, figures, and tables must be originally generated.

Fine Grain MPI Earl J. Dodd Humaira Kamal, Alan University of British Columbia 1.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1

SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

A Cloudy View on Computing Workshop and CReSIS Field Data Accessibility Jerome E. Mitchell Indiana University.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

SALSA HPC Group School of Informatics and Computing Indiana University.

Large Scale Machine Translation Architectures Qin Gao.

A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

SALSA HPC Group School of Informatics and Computing Indiana University.

Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Toward Efficient and Simplified Distributed Data Intensive Computing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011PPT.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Authors: Jiann-Liang Chenz, Szu-Lin Wuy, Yang-Fang Li, Pei-Jia Yang,

SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu

SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

Lecture #4 Introduction to Data Parallelism and MapReduce CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.

Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.

Hadoop Aakash Kag What Why How 1.

An Open Source Project Commonly Used for Processing Big Data Sets

Pagerank and Betweenness centrality on Big Taxi Trajectory Graph

Rainfall data analysis and Storm Prediction System

I590 Data Science Curriculum August

Applying Twister to Scientific Applications

Data Science Curriculum March

SC09 Doctoral Symposium, Portland, 11/18/2009

Agent-based Model Simulation with Twister

Scalable Parallel Interoperable Data Analytics Library

Twister4Azure : Iterative MapReduce for Azure Cloud

Hadoop Technopoints.

Parallel Applications And Tools For Cloud Computing Environments

Digital Science Center III

Towards High Performance Data Analytics with Java

Iterative and non-Iterative Computations

Presentation transcript:

Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun - Judy Qiu - Indiana University – School of Computing and Informatics - CSCI-B 649 ABSTRACT MapReduce programming model has proven its value in the domain of large scale data analysis since the introduction by Google. Twister extends this model efficiently with its support for iterative MapReduce with streaming data transfer via publish/subscribe messaging. This efficiency is challenged recently with applications requiring large chunks of data transfer. We propose having a unified memory access layer across Twister nodes instead of a publish/subscribe broker as a viable solution. The implementation integrates the open source Memcached, a distributed memory caching system, with Twister. We expect to get comparable performance results on this approach against current Twister implementation. We also plan to compare performance on our implementation with another proposed solution, which uses RPC messaging with Twister. INTRODUCTION  Google pioneered a scalable distributed framework with their implementation of MapReduce [1] framework, which supports map and reduce model of computation  Twister [3], an iterative MapReduce framework developed by the Pervasive Technology Institute’s SALSA group at Indiana University has gained performance over comparable frameworks like Hadoop and DryadLINQ [4] because of its streaming data transfer via publish/subscribe messaging  Twister messaging has shown delays when transferring large chunks of data. Our solution to this challenge is to integrate a unified memory access layer across Twister nodes to handle the large data transfer  This eliminates the need of a messaging broker and is expected to gain comparable performance against current implementation of Twister REFERENCES [1]J. Dean and S. Ghemawat, "MapReduce: simplified data processing on large clusters," Commun. ACM, vol. 51, pp , [2]Apache Hadoop. Available: [3]Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox, Twister: A Runtime for Iterative MapReduce," The First International Workshop on MapReduce and its Applications (MAPREDUCE'10) - HPDC2010 [4]Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and C. J., "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language," in Symposium on Operating System Design and Implementation (OSDI), [5]Deris M et. al, 2003, High Service Reliability For Cluster Server Systems, Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’03). [6]Fitzpatrick B., 2004, Distributed Caching with Memcached, Linux journal. [7]Shen H.H., Chen S.M., Zheng W.M., and Shi S.M., 2001, ”A communication model for Data Availability on Server clusters”, Proc. Int’l. Symposium on Distributed Computing and Application, [8]Cvetanovic Z., 2004, Performance Analysis Tools for Large-scale Linux Clusters, , IEEE 2004 RELATED WORK There are two proposed solutions to overcome the bottleneck of data transfer via publish/subscribe at present: A peer-to-peer data transfer via TCP in Twister. Thus, the messaging broker would only handle control messages leaving the burden of data transfer to compute nodes themselves Use RPC messaging between nodes to handle communication. This is currently under development in parallel with our solution by another group of students at Indiana University MEMCACHED Memcached provides a unified layer of memory across a set of machines (Figure 1). The memory layer simply enables storing and retrieving key value pairs. This makes it more suitable for Twister as the communication often results in key value pairs. Once integrated (Figure 2), Twister will be able to access objects seamlessly across its compute nodes giving better performance over its current architecture (Figure 3). VALIDATION AND EVALUATION METHODS In order to determine the scalability of a messaging infrastructure for various loads, a guideline must be undertaken. 1. The effectiveness of the infrastructure 2. The time taken for messages to be delivered 3. The memory and CPU utilization of the architecture We also intend to compare and contrast using the above criteria with Twister’s current implementation and the alternative solution of using RPC messaging with Twister. We explain two of the key applications of which we intend to compare the performance. PageRank algorithm calculates the new access probability for each web page based on values calculated in the previous computation. The iterating will not stop until the difference (δ) is less than a predefined threshold, where δ is the vector distance between the page access probabilities in Nth iteration and those in (N+1) th iteration Biological sequence alignment algorithm using Smith- Waterman is of classic all-pairs nature that performs local alignment for each pair of input sequences. This requires data transfer across each node while performing the computation thus giving us good grounds for performance comparison Compute Node Spare Memory Unified View of Spare Memory and Object Caching via Memcached Figure 1: Memcached View of Spare Memory Unified Memory Access with Memcached Daemon Compute Nodes Driver Head Node Daemon Figure 3: Proposed Memcached Integration Compute NodesHead Node Broker Daemon TopicsControl Topic Subscribed Topics Daemon Figure 2: Twister Architecture Daemon Driver