MapReduce for Data Intensive Scientific Analyses

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

MapReduce Simplified Data Processing on Large Clusters
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Distributed Computations
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
A Collaborative Framework for Scientific Data Analysis and Visualization Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Image taken from: slideshare
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Introduction to MapReduce and Hadoop
Software Engineering Introduction to Apache Hadoop Map Reduce
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Applying Twister to Scientific Applications
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
SC09 Doctoral Symposium, Portland, 11/18/2009
GCC2008 (Global Clouds and Cores 2008) October Geoffrey Fox
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Ch 4. The Evolution of Analytic Scalability
Distributed System Gang Wu Spring,2018.
Introduction to MapReduce
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Clouds and Grids Multicore and all that
Presentation transcript:

MapReduce for Data Intensive Scientific Analyses Jaliya Ekanayake Shrideep Pallickara Geoffrey Fox Department of Computer Science Indiana University Bloomington, IN, 47405 11/14/2018 Jaliya Ekanayake

MapReduce and the Current Implementations Current Limitations Presentation Outline Introduction MapReduce and the Current Implementations Current Limitations Our Solution Evaluation and the Results Future Work and Conclusion 11/14/2018 Jaliya Ekanayake

Data/Compute Intensive Applications Computation and data intensive applications are increasingly prevalent The data volumes are already in peta-scale High Energy Physics (HEP) Large Hadron Collider (LHC) - Tens of Petabytes of data annually Astronomy Large Synoptic Survey Telescope -Nightly rate of 20 Terabytes Information Retrieval Google, MSN, Yahoo, Wal-Mart etc.. Many compute intensive applications and domains HEP, Astronomy, chemistry, biology, and seismology etc.. Clustering Kmeans, Deterministic Annealing, Pair-wise clustering etc… Multi Dimensional Scaling (MDS) for visualizing high dimensional data 11/14/2018 Jaliya Ekanayake

Composable Applications How do we support these large scale applications? Efficient parallel/concurrent algorithms and implementation techniques Some key observations Most of these applications are: A Single Program Multiple Data (SPMD) program or a collection of SPMDs Exhibits the composable property Processing can be split into small sub computations The partial-results of these computations are merged after some post-processing Loosely synchronized (Can withstand communication latencies typically experienced over wide area networks) Distinct from the closely coupled parallel applications and totally decoupled applications With large volumes of data and higher computation requirements, even closely coupled parallel applications can withstand higher communication latencies? 11/14/2018 Jaliya Ekanayake

The Composable Class of Applications Tightly synchronized (microseconds) Set of TIF files Input SPMDs Loosely synchronized (milliseconds) PDF Files Totally decoupled application Cannon’s Algorithm for matrix multiplication – tightly coupled application Composable application Composable class can be implemented in high-level programming models such as MapReduce and Dryad 11/14/2018 Jaliya Ekanayake

MapReduce “MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.” MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat 11/14/2018 Jaliya Ekanayake

MapReduce The framework supports: 3 A hash function maps the results of the map tasks to r reduce tasks 3 Data is split into m parts 1 O1 D1 D2 Dm O2 Data map reduce data split A combine task may be necessary to combine all the outputs of the reduce functions together 5 map function is performed on each of these data parts concurrently 2 Once all the results for a particular reduce task is available, the framework executes the reduce task 4 The framework supports: Splitting of data Passing the output of map functions to reduce functions Sorting the inputs to the reduce function based on the intermediate keys Quality of services 11/14/2018 Jaliya Ekanayake

Hadoop Example: Word Count map(String key, String value): // key: document name // value: document contents reduce(String key, Iterator values): // key: a word // values: a list of counts E.g. Word Count Task Trackers Execute Map tasks Output of map tasks are written to local files Retrieve map results via HTTP Sort the outputs Execute reduce tasks M M 1 2 R R 2 1 DN TT DN TT Data/Compute Nodes M M 3 4 R 4 3 DN TT DN TT 11/14/2018 Jaliya Ekanayake

Current Limitations The MapReduce programming model could be applied to most composable applications but; Current MapReduce model and the runtimes focus on “Single Step” MapReduce computations only Intermediate data is stored and accessed via file systems Inefficient for the iterative computations to which the MapReduce technique could be applied No performance model to compare with other high-level or low-level parallel runtimes 11/14/2018 Jaliya Ekanayake

Content Dissemination Network CGL-MapReduce Data Split D MR Driver User Program Content Dissemination Network File System M R Worker Nodes Map Worker M Reduce Worker R MRDeamon D Data Read/Write Communication Architecture of CGL-MapReduce A streaming based MapReduce runtime implemented in Java All the communications(control/intermediate results) are routed via a content dissemination network Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files MRDriver Maintains the state of the system Controls the execution of map/reduce tasks User Program is the composer of MapReduce computations Support both single step and iterative MapReduce computations 11/14/2018 Jaliya Ekanayake

CGL-MapReduce – The Flow of Execution Fixed Data Initialization Start the map/reduce workers Configure both map/reduce tasks (for configurations/fixed data) 1 Initialize Variable Data map reduce Iterative MapReduce Map Execute map tasks passing <key, value> pairs 2 combine Terminate Reduce Execute reduce tasks passing <key, List<values>> 3 Data Split D MR Driver User Program Content Dissemination Network File System M R Worker Nodes Combine Combine the outputs of all the reduce tasks 4 Termination Terminate the map/reduce workers 5 CGL-MapReduce, the flow of execution 11/14/2018 Jaliya Ekanayake

HEP Data Analysis Data: Up to 1 terabytes of data, placed in IU Data Capacitor Processing:12 dedicated computing nodes from Quarry (total of 96 processing cores) MapReduce for HEP data analysis HEP data analysis, execution time vs. the volume of data (fixed compute resources) Hadoop and CGL-MapReduce both show similar performance The amount of data accessed in each analysis is extremely large Performance is limited by the I/O bandwidth The overhead induced by the MapReduce implementations has negligible effect on the overall computation 11/14/2018 Jaliya Ekanayake

HEP Data Analysis Scalability and Speedup Execution time vs. the number of compute nodes (fixed data) Speedup for 100GB of HEP data 100 GB of data One core of each node is used (Performance is limited by the I/O bandwidth) Speedup = MapReduce Time / Sequential Time Speed gain diminish after a certain number of parallel processing units (after around 10 units) 11/14/2018 Jaliya Ekanayake

Kmeans Clustering MapReduce for Kmeans Clustering Kmeans Clustering, execution time vs. the number of 2D data points (Both axes are in log scale) All three implementations perform the same Kmeans clustering algorithm Each test is performed using 5 compute nodes (Total of 40 processor cores) CGL-MapReduce shows a performance close to the MPI implementation Hadoop’s high execution time is due to: Lack of support for iterative MapReduce computation Overhead associated with the file system based communication 11/14/2018 Jaliya Ekanayake

Overheads of Different Runtimes Overhead f(P)= [P T(P)–T(1)]/T(1) P - The number of hardware processing units T(P) – The time as a function of P T(1) – The time when a sequential program is used (P=1) Overhead diminishes with the amount of computation Loosely synchronous MapReduce (CGL-MapReduce) also shows overheads close to MPI for sufficiently large problems Hadoop’s higher overheads may limit its use for these types(iterative MapReduce) of computations

More Applications Matrix multiplication -> iterative algorithm Matrix Multiply Histogramming Words MapReduce for Matrix Multiplication Matrix multiplication -> iterative algorithm Histogramming words -> simple MapReduce application Streaming approach provide better performance in both applications 11/14/2018 Jaliya Ekanayake

Multicore and the Runtimes The papers [1] and [2] evaluate the performance of MapReduce using Multicore computers Our results show the converging results for different runtimes The right hand side graph could be a snapshot of this convergence path Easiness to program could be a consideration Still, threads are faster in shared memory systems [1] Evaluating MapReduce for Multi-core and Multiprocessor Systems. By C. Ranger et al. [2] Map-Reduce for Machine Learning on Multicore by C. Chu et al. 11/14/2018 Jaliya Ekanayake

Conclusions MapReduce /Cloud Parallel Algorithms with: Fine grained sub computations Tight synchronization constraints Parallel Algorithms with: Corse grained sub computations Loose synchronization constraints Given sufficiently large problems, all runtimes converge in performance Streaming-based map reduce implementations provide faster performance necessary for most composable applications Support for iterative MapReduce computations expands the usability of MapReduce runtimes

Future Work Research on different fault tolerance strategies for CGL-MapReduce and come up with a set of architectural recommendations Integration of a distributed file system such as HDFS Applicability for cloud computing environments 11/14/2018 Jaliya Ekanayake

Questions? Thank You! 11/14/2018 Jaliya Ekanayake

Hadoop vs. CGL-MapReduce DRYAD Fault Tolerance Rootlet Architecture Links Hadoop vs. CGL-MapReduce Is it fair to compare Hadoop with CGL-MapReduce ? DRYAD Fault Tolerance Rootlet Architecture Nimbus vs. Eucalyptus 11/14/2018 Jaliya Ekanayake

Hadoop vs. CGL-MapReduce Feature Hadoop CGL-MapReduce Implementation Language Java Other Language Support Uses Hadoop Streaming (Text Data only) Requires a Java wrapper classes Distributed File System HDFS Currently assumes a shared file system between nodes Accessing binary data from other languages Currently only a Java interface is available Shared file system enables this functionality Fault Tolerance Support failures of nodes Currently does not support fault tolerance Iterative Computations Not supported Supports Iterative MapReduce Daemon Initialization Requires ssh public key access 11/14/2018 Jaliya Ekanayake

Is it fair to compare Hadoop with CGL-MapReduce ? Hadoop access data via a distributed file system Hadoop stores all the intermediate results in this file system to ensure fault tolerance Is this is the optimum strategy? Can we use Hadoop for only “single pass” MapReduce computations? Writing the intermediate results at the Reduce task will be a better strategy? Considerable reduction in data from map -> reduce Possibility of using duplicate reduce tasks 11/14/2018 Jaliya Ekanayake

Fault Tolerance Hadoop/Google Fault Tolerance CGL-MapReduce Data (input, output, and intermediate) are stored in HDFS HDFS uses replications Task tracker is a single point of failure (Checkpointing) Re-executing failed map/reduce tasks CGL-MapReduce Integration of HDFS or similar parallel file system for input/output data MRDriver is a single point of failure (Checkpointing) Re-executing failed map tasks (Considerable reduction in data from map -> reduce) Redundant reduce tasks Amazon model S3, EC2 and SQS Reliable Queues 11/14/2018 Jaliya Ekanayake

DRYAD The computation is structured as a directed graph A Dryad job is a graph generator which can synthesize any directed acyclic graph These graphs can even change during execution, in response to important events in the computation Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting How to support iterative computations? 11/14/2018 Jaliya Ekanayake