Only Aggressive Elephants are Fast Elephants Nov 11 th 2013 Database Lab. Wonseok Choi.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Shark:SQL and Rich Analytics at Scale
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
SkewTune: Mitigating Skew in MapReduce Applications
Spark: Cluster Computing with Working Sets
Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.
Clydesdale: Structured Data Processing on MapReduce Jackie.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Distributed Computations
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce VS Parallel DBMSs
Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, Bhavani Thuraisingham University.
MapReduce How to painlessly process terabytes of data.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
Fine-grained Partitioning for Aggressive Data Skipping Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin† UC Berkeley and †Databricks Inc.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
A Comparison of Join Algorithms for Log Processing in MapReduce SIGMOD 2010 Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita,
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Matrix Multiplication in Hadoop
BIG DATA/ Hadoop Interview Questions.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Data Platform and Analytics Foundational Training
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
CS 345A Data Mining MapReduce This presentation has been altered.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Only Aggressive Elephants are Fast Elephants Nov 11 th 2013 Database Lab. Wonseok Choi

2

3 목차 1. Introduction 2. Overview 3. The HAIL upload pipeline 4. The HAIL query pipeline 5. Experiments 6. Conclusion

4 Introduction Yellow elephants are slow. Because they consume their inputs entirely before responding to an elephant rider’s orders. An enhancement of HDFS and Hadoop MapReduce that dramatically improves runtimes of several classes of MapReduce jobs. HAIL changes the upload pipeline of HDFS in order to create different clustered indexes on each data block replica.

5 Introduction - idea HAIL keeps the existing physical replicas of an HDFS block in different sort orders and with different clustered indexes. We modify the upload pipeline of HDFS to already create those indexes while uploading data to HDFS. HAIL typically improves both the upload times (even if index creation is included) and the MapReduce job execution times.

6 Introduction - contribution HAIL even allows us to create more than three indexes at reasonable costs. And Hadoop is only used for appends: there are no updates. how to effectively change the Hadoop MapReduce pipeline to exploit HAIL indexes. present an extensive experimental comparison of HAIL with Hadoop and Hadoop++ HAIL reduces this overhead significantly using a novel splitting policy to partition data at query time.

7 Overview Hadoop and HDFS HAIL HAIL Benefits

8 The HAIL upload pipeline

9

10 The HAIL query pipeline SELECT sourceIP FROM UserVisits WHERE visitDate BETWEEN ‘ ’ AND ‘ ’; between( , )", void map(Text key, Text v) {... }

11 The HAIL query pipeline MAP FUNCTION FOR HADOOP MAPREDUCE (PSEUDO-CODE): void map(Text key, Text v) { String[] attributes = v.toString().split(","); if (DateUtils.isBetween(attributes[2], " ", " ")) output(attributes[0], null); } MAP FUNCTION FOR HAIL: void map(Text key, HailRecord v) { output(v.getInt(1), null); }

12 Experiments physical 10-node cluster Each node has one 2.66GHz Quad Core Xeon processor running 64-bit platform Linux openSuse 11.1 OS 4x4GB of main memory 6x750GB SATA hard disks three Gigabit network cards to understand how well HAIL scales-out, we also consider two more EC2 clusters: one with 50 nodes and one with 100 nodes

13 Experiments

14 Experiments

15 Experiments

16 Experiments

17 Experiments

18 Conclusion We have presented HAIL (Hadoop Aggressive Indexing Library). HAIL improves the upload pipeline of HDFS to create different clustered indexes on each replica. A major advantage of HAIL is that the long upload and indexing times which had to be invested on previous systems are not required anymore. We showed that HAIL typically creates a win-win situation: users can upload their datasets up to 1.6x faster than Hadoop and run jobs up to 68x faster than Hadoop.

Q & A 19