Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Project presentation by Mário Almeida Implementation of Distributed Systems KTH 1.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
© 2011 VMware Inc. All rights reserved High Availability Module 7.
Hadoop 2.0 and YARN SUBASH D’SOUZA. Who am I?  Senior Specialist Engineer at Shopzilla  Co-Organizer for the Los Angeles Hadoop User group  Organizer.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Hadoop YARN in the Cloud Junping Du Staff Engineer, VMware China Hadoop Summit, 2013.
Resource Management with YARN: YARN Past, Present and Future
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Hortonworks Eric Baldeschwieler – CEO © Hortonworks Inc Architecting the Future of Big Data June 29, 2011.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
GridGain In-Memory Data Fabric:
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.
A Platform for Fine-Grained Resource Sharing in the Data Center
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen
Our Experience Running YARN at Scale Bobby Evans.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Hadoop implementation of MapReduce computational model Ján Vaňo.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
A Platform for Fine-Grained Resource Sharing in the Data Center
Hadoop & Neptune Feb 김형준.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Apache Hadoop on Windows Azure Avkash Chauhan
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 10 Data Analytics for IoT
Apache Hadoop YARN: Yet Another Resource Manager
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Distributed Systems CS
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Technopoints.
Overview of big data tools
Distributed Systems CS
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, Yahoo! 8 Yahoo! © Hortonworks Inc. 2011June 29, 2011

Hello! I’m Arun… Architect & Lead, Apache Hadoop MapReduce Development Team at Hortonworks (formerly at Yahoo!) Apache Hadoop Committer and Member of PMC −Full-time contributor to Apache Hadoop since early 2006

Hadoop MapReduce Today JobTracker −Manages cluster resources and job scheduling TaskTracker −Per-node agent −Manage tasks

Current Limitations Scalability −Maximum Cluster size – 4,000 nodes −Maximum concurrent tasks – 40,000 −Coarse synchronization in JobTracker Single point of failure −Failure kills all queued and running jobs −Jobs need to be re-submitted by users Restart is very tricky due to complex state Hard partition of resources into map and reduce slots © Hortonworks Inc

Current Limitations Lacks support for alternate paradigms −Iterative applications implemented using MapReduce are 10x slower. −Example: K-Means, PageRank Lack of wire-compatible protocols −Client and cluster must be of same version −Applications and workflows cannot migrate to different clusters © Hortonworks Inc

Requirements Reliability Availability Scalability - Clusters of 6,000-10,000 machines −Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks −100,000+ concurrent tasks −10,000 concurrent jobs Wire Compatibility Agility & Evolution – Ability for customers to control upgrades to the grid software stack. © Hortonworks Inc

Design Centre Split up the two major functions of JobTracker −Cluster resource management −Application life-cycle management MapReduce becomes user-land library © Hortonworks Inc

Architecture

Resource Manager −Global resource scheduler −Hierarchical queues Node Manager −Per-machine agent −Manages the life-cycle of container −Container resource monitoring Application Master −Per-application −Manages application scheduling and task execution −E.g. MapReduce Application Master © Hortonworks Inc

Improvements vis-à-vis current MapReduce Scalability −Application life-cycle management is very expensive −Partition resource management and application life-cycle management −Application management is distributed −Hardware trends - Currently run clusters of 4,000 machines 6, machines > 12, machines v/s © Hortonworks Inc

Improvments vis-à-vis current MapReduce Fault Tolerance and Availability −Resource Manager No single point of failure – state saved in ZooKeeper Application Masters are restarted automatically on RM restart Applications continue to progress with existing resources during restart, new resources aren’t allocated −Application Master Optional failover via application-specific checkpoint MapReduce applications pick up where they left off via state saved in HDFS © Hortonworks Inc

Improvements vis-à-vis current MapReduce Wire Compatibility −Protocols are wire-compatible −Old clients can talk to new servers −Rolling upgrades © Hortonworks Inc

Improvements vis-à-vis current MapReduce Innovation and Agility −MapReduce now becomes a user-land library −Multiple versions of MapReduce can run in the same cluster (a la Apache Pig) Faster deployment cycles for improvements −Customers upgrade MapReduce versions on their schedule −Users can customize MapReduce e.g. HOP without affecting everyone! © Hortonworks Inc

Improvements vis-à-vis current MapReduce Utilization −Generic resource model Memory CPU Disk b/w Network b/w −Remove fixed partition of map and reduce slots © Hortonworks Inc

Improvements vis-à-vis current MapReduce Support for programming paradigms other than MapReduce −MPI −Master-Worker −Machine Learning −Iterative processing −Enabled by allowing use of paradigm-specific Application Master −Run all on the same Hadoop cluster © Hortonworks Inc

Summary MapReduce.Next takes Hadoop to the next level −Scale-out even further −High availability −Cluster Utilization −Support for paradigms other than MapReduce © Hortonworks Inc

Status – July, 2011 Feature complete Rigorous testing cycle underway −Scale testing at ~500 nodes Sort/Scan/Shuffle benchmarks GridMixV3! −Integration testing Pig integration complete! Coming in the next release of Apache Hadoop! Beta deployments of next release of Apache Hadoop at Yahoo! in Q4, 2011 © Hortonworks Inc

Questions? © Hortonworks Inc

© Hortonworks Inc. 2011