Stephen Tak-Lon Wu Indiana University Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet, Google Distributed.

Slides:



Advertisements
Similar presentations
Hadoop Programming. Overview MapReduce Types Input Formats Output Formats Serialization Job g/apache/hadoop/mapreduce/package-
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
MapReduce Simplified Data Processing on Large Clusters
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Big Data Analytics with R and Hadoop
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
HAMS Technologies 1
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
HADOOP Priyanshu Jha A.D.Dilip 6 th IT. Map Reduce patented[1] software framework introduced by Google to support distributed computing on large data.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Introduction to Google MapReduce
INTRODUCTION TO BIGDATA & HADOOP
Introduction to MapReduce and Hadoop
Calculation of stock volatility using Hadoop and map-reduce
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
The Basics of Apache Hadoop
Hadoop Basics.
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Map Reduce, Types, Formats and Features
Presentation transcript:

Stephen Tak-Lon Wu Indiana University Some material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels- Slettvet, Google Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

 MapReduce Overview  Hadoop-WordCount Demo  Hadoop-Blast Demo  Q&A

 Mapreduce – The Story of Tom Mapreduce – The Story of Tom  By Saliya Ekanayake

 Map ◦ Apply the same function f to all the elements of a list  Fold ◦ Moves across a list, applying function g to each element plus an accumulator. Result f g f g f g f g f g initial Map Fold

 influenced by Functional Programming constructs  Users implement interface of two functions: ◦ map (in_key, in_value) -> (out_key, intermediate_value) list ◦ reduce (out_key, intermediate_value list) -> out_value list Source:

 A list of data elements are passed, one at a time, to map() functions which transform each data element to an individual output data element.  A map() produces one or more intermediate pair(s) from the input list. k1k1 V1V1 k2k2 V2V2 k5k5 V5V5 k4k4 V4V4 k3k3 V3V3 MAP k6k6 V6V6 …… k’ 1 V’ 1 k’ 2 V’ 2 k’ 5 V’ 5 k’ 4 V’ 4 k’ 3 V’ 3 k’ 6 V’ 6 …… Input list Intermediate output list

 After map phase finish, those intermediate values with same output key are reduced into one or more final values Reduce k’ 1 V’ 1 k’ 2 V’ 2 k’ 5 V’ 5 k’ 4 V’ 4 k’ 3 V’ 3 k’ 6 V’ 6 …… Reduce F1F1 R1R1 F2F2 R2R2 F3F3 R3R3 …… Intermediate map output Final Result

 map() functions run in parallel, creating different intermediate values from different input data elements  reduce() functions also run in parallel, working with assigned output key  All values are processed independently  Reduce phase can’t start until map phase is completely finished. Source:

 Open Source MapReduce framework implementation  Widely used by ◦ Amazon Elastic MapReduce ◦ EBay ◦ FaceBook ◦ TeraSort in Yahoo! ◦ …  Active community  Many sub projects ◦ HDFS ◦ Pig Latin ◦ Hbase ◦ … Source:

Source:

 Autoparallelization  data distribution  Fault-tolerant  Status monitoring  Simple programming constructs

 “Hello World” in MapReduce programming style  Fits well with the MapReduce programming model ◦ count occurrence of each word in given files ◦ each file (may be split) run in parallel ◦ need reduce phase to collect final result  Three main parts: ◦ Mapper ◦ Reducer ◦ Driver

foo car bar foo bar foo car car car foo, 1 car, 1 bar, 1 foo, 1 car, 1 bar, 1 foo, 1 bar, 1 foo, 1 bar, 1 foo, 1 car, 1 foo, 1 car, 1 bar, 1 foo, 3 bar, 2 car, 4 InputMappingShuffling Reducing

void Map (key, value) { for each word x in value: output.collect(x, 1); }

public static class Map extends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); }

void Reduce (keyword, ){ for each x in : sum+=x; output.collect(keyword, sum); }

public static class Reduce extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context) throws …..{ int sum = 0; // initialize the sum for each keyword for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

 qsub –I ◦ Get a FutureGrid Work node  hostname ◦ e.g. i51 ◦ Public hostname name should be  i51r.idp.iu.futuregrid.org 

 cd Hadoop-WordCount  cat WordCount.java  cat build.sh ./build.sh  ls -l

 cd ~/hadoop standalone/bin  cp ~/Hadoop-WordCount/wordcount.jar ~/hadoop standalone/bin ./hadoop jar wordcount.jar WordCount ~/Hadoop- WordCount/input ~/Hadoop-WordCount/output  Simulate HDFS using the local directories  cd ~/Hadoop-WordCount/  cat output/part-r-00000

 Reduce phase can only be activated until all map tasks finish ◦ It will be a waste if there is a extremely long map task  “Combiner” functions can run on same machine as a mapper  Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth

foo car bar foo bar foo car car car foo, 1 car, 1 bar, 1 foo, 1 car, 1 bar, 1 foo, 2 bar, 1 foo, 2 bar, 1 car, 3 foo, 1 foo, 2 foo, 1 foo, 2 car, 1 car, 3 car, 1 car, 3 bar, 1 foo, 3 bar, 2 car, 4 InputCombinerShuffling Reducing foo, 1 car, 1 bar, 1 foo, 1 car, 1 bar, 1 foo, 1 bar, 1 foo, 1 bar, 1 foo, 1 car, 1 Mapping

 Job  Jobtracker  Task  TaskTracker  Shard

 Based on Google File System  Run on commodity low-cost hardware  Fault tolerance ◦ Redundant storage  High throughput access  Suitable for large data sets  Large capacity  Integration with MapReduce

 Single NameNode  Multiple Data Nodes Source:

 dfs.name.dir ◦ Dir in the namenode to store metadata  dfs.data.dir ◦ Dir in data nodes to store the data blocks ◦ Must be in a local disk partition conf/hdfs-site.xml dfs.name.dir /tmp/hadoop-test/name dfs.data.dir /tmp/hadoop-test/data

 fs.default.name ◦ URI of the namenode  conf/masters ◦ ip address of the master node  conf/slaves ◦ ip address of the slave nodes  Should have password-less SSH access to all the nodes conf/core-site.xml fs.default.name hdfs:// :9000/

 Starting HDFS ◦ cd ~/hadoop /bin ◦./hadoop namenode -format ◦./start-dfs.sh  NameNode logs ◦ cd ~/hadoop ◦ logs/hadoop- -namenode-.log  Monitoring web interface ◦ :50070

./bin/hadoop fs –[command] ◦ put ◦ get ◦ ls ◦ cp ◦ mkdir ◦ rm  rent/hdfs_shell.html rent/hdfs_shell.html  Programmatic API

 mapred.job.tracker  mapred.local.dir  mapred.tasktracker.map.tasks.maximum conf/mapred-site.xml mapred.job.tracker :9001 mapred.local.dir /tmp/hadoop-test/local mapred.tasktracker.map.tasks.maximum 8

 Starting Hadoop MapReduce ◦ cd ~/hadoop /bin ◦./start-mapred.sh  JobTracker logs ◦ cd ~/hadoop ◦ logs/hadoop- - tasktracker.log  Monitoring web interface ◦ :50030

 Upload input to HDFS ◦ cd ~/hadoop /bin ◦./hadoop fs -put ~/Hadoop-WordCount/input/ input ◦./hadoop fs -ls input  Run word count ◦./hadoop jar wordcount.jar WordCount input output ◦./hadoop fs -ls output ◦./hadoop fs -cat output/*

 An advance MapReduce implementation  Use Blast (Basic Local Alignment Search Tool), a well-known bioinformatics application written in C/C++  Utilize the Computing Capability of Hadoop  Use Distributed Cache for the Blast Program and Database  No Reducer in practice

public void map(String key, String value, Context context) throws IOException, InterruptedException {... // download the file from HDFS … fs.copyToLocalFile(inputFilePath, new Path(localInputFile));// Prepare the argume nts to the executable... execCommand = this.localBlastProgram + File.separator + execName + " " + exe cCommand + " -db " + this.localDB; //Create the external process Process p = Runtime.getRuntime().exec(execCommand);... p.waitFor(); //Upload the results to HDFS … fs.copyFromLocalFile(new Path(outFile),outputFileName);... } // end of overriding the map

... // using distributed cache DistributedCache.addCacheArchive(new URI(BlastPr ogramAndDB), jc);... // other parts... // input and output format job.setInputFormatClass(DataFileInputFormat.class ); job.setOutputFormatClass(SequenceFileOutputFor mat.class); …

 Make sure the HDFS and Map-Reduce daemon start correctly  Put Blast queries on HDFS ◦ cd ~/hadoop /bin ◦./hadoop fs -put ~/hadoop /apps/Hadoop- Blast/input HDFS_blast_input ◦./hadoop fs -ls HDFS_blast_input  Copy Blast Program and DB to HDFS ◦./hadoop fs -copyFromLocal $BLAST_HOME/BlastProgramAndDB.tar.gz BlastProgramAndDB.tar.gz ◦./hadoop fs -ls BlastProgramAndDB.tar.gz

 Run the Hadoop-Blast program ◦ cd ~/hadoop /bin ◦./hadoop jar ~/hadoop /apps/Hadoop- Blast/executable/blast-hadoop.jar BlastProgramAndDB.tar.gz bin/blastx /tmp/hadoop- test/ db nr HDFS_blast_input HDFS_blast_output '-query #_INPUTFILE_# -outfmt 6 -seg no -out #_OUTPUTFILE_#'  Check the Result ◦ cd ~/hadoop /bin ◦./hadoop fs -ls HDFS_blast_output ◦./hadoop fs -cat HDFS_blast_output/pre_1.fa

 More Map tasks ◦ Much more than the Map task capacity of the cluster  Number of reduce tasks ◦ 95% of the Reduce task capacity ◦ Interleave computation with communication  Data locality  Map/Reduce task capacity of a worker  Logs ◦ Don’t store in NFS. They grow pretty fast.

 Big Data for Science Tutorial Big Data for Science Tutorial  Yahoo! Hadoop Tutorial Yahoo! Hadoop Tutorial  Hadoop wiki Hadoop wiki  Apache Hadoop Map/Reduce Tutorial Apache Hadoop Map/Reduce Tutorial  Google: Cluster Computing and MapReduce Google: Cluster Computing and MapReduce

foo car bar foo bar foo car car car foo, 1 car, 1 bar, 1 foo, 1 car, 1 bar, 1 foo, 1 bar, 1 foo, 1 bar, 1 foo, 1 car, 1 foo,1 car,1 bar, 1 foo, 1 bar, 1 foo, 1 car, 1 foo,1 car,1 bar, 1 foo, 1 bar, 1 foo, 1 car, 1 bar, car, foo, bar, car, foo, bar,2 car,4 foo,3 bar,2 car,4 foo,3 InputMapping Shuffling Reducing Sorting Execute 3 times Framework level

 MasterNode handles failures ◦ detects worker failures ◦ rerun completed & in-progress map() tasks ◦ rerun in-progress reduce() tasks ◦ If particular input key/values cause crashes in map phase, and skips those values on re-execution.  Partly output Source:

 Master program tries to have map() tasks on same machine as physical file data, or at least on same rack  map task inputs are spilt into blocks with default sizes as 64MB. Source:

 High component failure rates ◦ Inexpensive commodity components fail all the time  “Modest” number of HUGE files ◦ Just a few million ◦ Each is 100MB or larger; multi-GB files typical  Files are write-once, read many times  Large streaming reads  High sustained throughput favored over low latency

 Files stored as blocks ◦ Blocks stored across cluster ◦ Default block size 64MB  Reliability through replication ◦ Default is 3  Single master (NameNode) to coordinate access, keep metadata ◦ Simple centralized management  No data caching ◦ Little benefit due to large data sets, streaming reads  Focused on distributed applications

 Input formats  Output formats  Data types  Distributed Cache

 Defines how to break the inputs  Provides RecordReader objects to read the files ◦ Read data and converts to pairs  It’s possible to write custom input formats  Hadoop Built-in input formats ◦ TextInputFormat  Treats each line as value ◦ KeyValueInputFormat  Each line as key-value (tab delimited) ◦ SequenceFileInputFormat  Hadoop specific binary representation

 How the result pairs are written in to files  Possible to write custom output formats  Hadoop default OutputFormats ◦ TextOutputFormat  ◦ SequenceFileOutputFormat  Hadoop specific binary representation

 Implement Writable interface ◦ Text, IntWritable, LongWritable, FloatWritable, BooleanWritable, etc…  We can write custom types which Hadoop will serialize/deserialize as long as following is implemented. public interface Writable { void readFields(DataInput in); void write(DataOutput out); }

 Distributes a given set of data to all the mappers ◦ Common data files, librararies, etc.. ◦ Similar to a broadcast ◦ Available in the local disk ◦ Automatically extract compressed archives  First upload the files to HDFS  DistributedCache.addCacheFile()  DistributedCache.getLocalCacheFiles()