State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Resource Management with YARN: YARN Past, Present and Future
Hortonworks Eric Baldeschwieler – CEO © Hortonworks Inc Architecting the Future of Big Data June 29, 2011.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Developing and Deploying Apache Hadoop Security Owen O’Malley - Hortonworks Co-founder and © Hortonworks Inc.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Making Apache Hadoop Secure Devaraj Das Yahoo’s Hadoop Team.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Owen O’Malley Yahoo! Grid Team
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Our Experience Running YARN at Scale Bobby Evans.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Introduction to Distributed Platforms
Chapter 10 Data Analytics for IoT
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
Ministry of Higher Education
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Pig Hive HBase Zookeeper
Presentation transcript:

State of the Elephant Hadoop yesterday, today, and tomorrow Page 1 Owen

Ancient History Page 2 Back in 2005 –Hired by Yahoo to create new infrastructure for Seach WebMap –WebMap was graph of entire web: –100 billion nodes –1 trillion edges –300 TB compressed –Took weeks to create –Started designing and implementing C++ framework based on GFS and MapReduce.

Ancient History Page 3 In 2006 –Prototype was starting to run! –Decided to throw away Juggernaut and adopt Apache Hadoop. –Already open source –Running on 20 machines –Nice OO interfaces –Enabled Hadoop as a Service for Yahoo –Finally got WebMap on Hadoop in 2008

What is Hadoop? A framework for storing and processing big data on lots of commodity machines. –Up to 4,000 machines –Up to 20 PB High reliability done in software –Automated failover for data and computation Implemented in Java

What is Hadoop? HDFS – Distributed File System –Combines cluster’s local storage into a single namespace. –All data is replicated to multiple machines. –Provides locality information to clients MapReduce –Batch computation framework –Jobs divided into tasks. Tasks re-executed on failure –User code wrapped around a distributed sort –Optimizes for data locality of input

Hadoop Usage at Yahoo Yahoo! uses Hadoop a lot 43,000 computers in ~20 Hadoop clusters. Clusters run as shared service for yahoos. Hundreds of users every month More than 1 million jobs every month Four categories of clusters: Development, Alpha, Research, & Production Increased productivity and innovation 6

Open Source Spectrum Closed Source –MapR, Oracle Open Releases –Redhat Kernels, CDH Open Development –Protocol Buffers Open Governance –Apache 7

287 Hadoop Contributors 8

Release History Page 9 59 Releases Branches from the last 2.5 years: –0.20.{0,1,2} – Stable, but old –0.20.2xx.y – Current stable releases (Should be 1.x.y!) – – Unstable Upcoming branches – – Release candidates being rolled (2.0.0??)

Today Page 10 Features in –Security –Multi-tenancy limits –Performance improvements Features in –RPMs & Debs –New metrics framework supported –Improved handling of disk failures Features in –HBase support –Experimental WebHDFS –Support renewal of arbitrary tokens by MapReduce

Page 11 Security –Prior versions of Hadoop trusted the client about the user’s login –Strong authentication using Kerberos (and ActiveDirectory) –Authenticates both the user and the server. –MapReduce tasks run as the user –Audit log provides accurate record of who read or wrote which data Multi-tenancy limits –Users do a *lot* of crazy things with Hadoop. –Hadoop is an extremely effective if unintentional DOS attack vector –If users aren’t given limits, they impact other users. Performance Improvements –Vastly improved Capacity Scheduler –Improved MapReduce shuffle

Page 12 Installation packages for popular operating systems –Simplifies installation and upgrade Metrics 2 framework –Allows multiple plugins to receive data Disk failure improvements –Allow servers to continue when a drive fails –Required for machines with more disks

Page 13 Support for HBase –Adds support for sync to HDFS WebHDFS –Experimental HTTP/REST interface to HDFS –Allows read/write access –Thin client supports other languages Web Authentication –SPENGO plugin for Kerberos web-UI authentication Add JobTracker for renewing and cancelling non- HDFS Delegation tokens –Hbase, MapReduce, and Oozie delegation tokens can be renewed

Tomorrow – Page 14 Timeline –First alpha versions in January –Final version in mid-2012 MapReduce V2 (aka YARN) Federation Performance improvements MapReduce libraries ported to new API

MapReduce v2 (aka YARN) Page 15 Separate cluster compute resource allocation from MapReduce MapReduce becomes a client-library –Increased innovation –Can run many versions of MapReduce on the same cluster –Users can pick when they want to upgrade MapReduce Supports non-MapReduce compute paradigms –Graph processing –Giraph –Iterative processing –Hama –Mahout –Spark

Architecture

Advantages of MapReduce v2 Page 17 Persistent store in Zookeeper –Working toward HA Generic resource model –Currently based on RAM Scales further –Much simpler state –Faster heartbeat response time Wire protocols managed with Protocol Buffers

Federation Page 18 HDFS scalability limited by RAM for NameNode –Entire namespace is stored in memory Scale out by partitioning the namespace between NameNodes –Each manages a directory sub-tree Allow HDFS to share Data Nodes between NameNodes –Permits sharing of raw storage between NameNodes Working on separating out the block pool layer Support clients using client side mount table –/project/foo -> hdfs://namenode2/foo

And Beyond… Page 19 High Availability –Question: How often has Yahoo had a NameNode’s hardware crash? –Answer: Once –Question: How much data was lost in that crash? –Answer: None –Automatic failover only minimizes downtime Wire Compatibility –Use Protocol Buffers for RPC –Enable communication between different versions of client and server –First step toward supporting rolling upgrades

But wait, there is more Page 20 Hadoop is just one layer of the stack –Updatable tables – HBase –Coordination – Zookeeper –Higher level languages – Pig and Hive –Graph processing – Giraph –Serialization – Protocol Buffers, Thrift and Avro How do you get all of the software installed and configured? –Apache Ambari –Controlled using CLI, Web UI, or REST –Manages clusters as a stack of components working together –Simplifies deploying and configuring Hadoop clusters –Let’s you check on the current state of the servers

HCatalog (aka HCat) Page 21 Manages meta-data for table storage –Based on Hive’s metadata server –Uses Hive language for metadata manipulation operations Provides access to tables from Pig, MapReduce, and Hive Tables may be stored in RCFile, Text files, or SequenceFiles

Questions? Page 22 Thank you! –My is –Planning discussions occur on development lists