Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.

Slides:



Advertisements
Similar presentations
Yahoo! Experience with Hadoop OSCON 2007 Eric Baldeschwieler.
Advertisements

Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Apache Hadoop and Hive.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Overview of MapReduce and Hadoop
CS246 TA Session: Hadoop Tutorial Peyman kazemian 1/11/2011.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Intro to Map-Reduce Feb 4, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Introduction to Google MapReduce WING Group Meeting 13 Oct 2006 Hendra Setiawan.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore INTRODUCTION TO HADOOP.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Oct. 3, 2011 Hadoop, HDFS and Microsoft Cloud Computing Technologies.
HADOOP ADMIN: Session -2
MapReduce Programming Yue-Shan Chang. split 0 split 1 split 2 split 3 split 4 worker Master User Program output file 0 output file 1 (1) fork (2) assign.
Lecture 3-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2012 Indranil Gupta (Indy) Sep 4, 2012 Lecture 3 Cloud Computing -
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
1 Map-reduce programming paradigm Some slides are from lecture of Matei Zaharia, and distributed computing seminar by Christophe Bisciglia, Aaron Kimball,
The Hadoop Distributed File System
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2011.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
MapReduce Costin Raiciu Advanced Topics in Distributed Systems, 2012.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
Lecture 4: Mapreduce and Hadoop
Map reduce Cs 595 Lecture 11.
Introduction to Google MapReduce
Hadoop Aakash Kag What Why How 1.
Hadoop.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Map-reduce programming paradigm
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Central Florida Business Intelligence User Group
Hadoop Basics.
Presentation transcript:

Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook

CCA – Oct 2008 Problem How do you scale up applications? –Run jobs processing 100’s of terabytes of data –Takes 11 days to read on 1 computer Need lots of cheap computers –Fixes speed problem (15 minutes on 1000 computers), but… –Reliability problems In large clusters, computers fail every day Cluster size is not fixed Need common infrastructure –Must be efficient and reliable

CCA – Oct 2008 Solution Open Source Apache Project Hadoop Core includes: –Hadoop Distributed File System - distributes data –Map/Reduce - distributes applications Written in Java Runs on –Linux, Mac OS/X, Windows, and Solaris –Commodity hardware

CCA – Oct 2008 Goals of HDFS Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS

CCA – Oct 2008 Distributed File System Single Namespace for entire cluster Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode

CCA – Oct 2008 Secondary NameNode Client HDFS Architecture NameNode DataNodes 1. filename 2. BlckId, DataNodes 3.Read data Cluster Membership Cluster Membership NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log

CCA – Oct 2008

NameNode Metadata Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. etc

CCA – Oct 2008 DataNode A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes

CCA – Oct 2008 Block Placement Default is 3 replicas, but settable Blocks are placed (writes are pipelined): –On same node –On different rack –On the other rack Clients read from closest replica If the replication for a block drops below target, it is automatically re-replicated.

CCA – Oct 2008 Data Correctness Data is checked with CRC32 File Creation –Client computes checksum per 512 bytes –DataNode stores the checksum File access –Client retrieves the data and checksum from DataNode –If Validation fails, Client tries other replicas Periodic Validation

CCA – Oct 2008 Map/Reduce Map/Reduce is a programming model for efficient distributed computing It works like a Unix pipeline: –cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output Efficiency from –Streaming through data, reducing seeks –Pipelining A good fit for a lot of applications –Log processing –Web index building

CCA – Oct 2008 Map/Reduce features Java and C++ APIs –In Java use Objects, while in C++ classes Each task can process data sets larger than RAM Automatic re-execution on failure –In a large cluster, some nodes are always slow or flaky –Framework re-executes failed tasks Locality optimizations –Map-Reduce queries HDFS for locations of input data –Map tasks are scheduled close to the inputs when possible

CCA – Oct 2008 Map Operator Examples (map sqrt ) yields the set ( ) (map sort (dog,3) (cat,5) (bat,6) (cat,7)) yields the ordered set (bat,6) (cat,7) (cat,5) (dog,3)

CCA – Oct 2008 Map/Reduce Dataflow

CCA – Oct 2008 Map/Reduce Example public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);}

CCA – Oct 2008 Map/Reduce Example public class WordCount { public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one= new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); // matches Text, IntWritable } //while }}

CCA – Oct 2008 Map/Reduce Example public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}

CCA – Oct 2008 How is Yahoo using Hadoop? We started with building better applications –Scale up web batch applications (search, ads, …) –Factor out common code from existing systems, so new applications will be easier to write –Manage the many clusters we have more easily The mission now includes research support –Build a huge data warehouse with many Yahoo! data sets –Couple it with a huge compute cluster and programming models to make using the data easy –Provide this as a service to our researchers –We are seeing great results! Experiments can be run much more quickly in this environment

CCA – Oct 2008 Commodity Hardware Cluster

CCA – Oct 2008 Hadoop clusters We have ~20,000 machines running Hadoop Our largest clusters are currently 2000 nodes Several petabytes of user data (compressed, unreplicated) We run hundreds of thousands of jobs every month

CCA – Oct 2008 Running Production WebMap Search needs a graph of the “known” web –Invert edges, compute link text, whole graph heuristics Periodic batch job using Map/Reduce –Uses a chain of ~100 map/reduce jobs Scale –1 trillion edges in graph –Largest shuffle is 450 TB –Final output is 300 TB compressed –Runs on 10,000 cores –Raw disk used 5 PB Written mostly using Hadoop’s C++ interface

CCA – Oct 2008 NY Times Needed offline conversion of public domain articles from Used Hadoop to convert scanned images to PDF Ran 100 Amazon EC2 instances for around 24 hours 4 TB of input 1.5 TB of output Published 1892, copyright New York Times

CCA – Oct 2008 Terabyte Sort Benchmark Started by Jim Gray at Microsoft in 1998 Sorting 10 billion 100 byte records Hadoop won the general category in 209 seconds –910 nodes –2 quad-core 2.0Ghz / node –4 SATA disks / node –8 GB ram / node –1 gb ethernet / node –40 nodes / rack –8 gb ethernet uplink / rack Previous records was 297 seconds Only hard parts were: –Getting a total order –Converting the data generator to map/reduce

CCA – Oct 2008 Hadoop at Facebook Production cluster –4800 cores, 600 machines, 16GB per machine – April 2009 –8000 cores, 1000 machines, 32 GB per machine – July 2009 –4 SATA disks of 1 TB each per machine –2 level network hierarchy, 40 machines per rack –Total cluster size is 2 PB, projected to be 12 PB in Q Test cluster 800 cores, 16GB each

CCA – Oct 2008 What’s Next? Better scheduling –Pluggable scheduler –Queues for controlling resource allocation between groups Splitting Core into sub-projects –HDFS, Map/Reduce, Hive Total Order Sampler and Partitioner Table store library HDFS and Map/Reduce security High Availability via Zookeeper

CCA – Oct 2008 Q&A For more information: –Website: –Mailing lists: –IRC: #hadoop on irc.freenode.org