Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010
Introduction Hadoop MapReduce Working With Hadoop Environment MapReduce Programming Summary
Is a software framework User should program Like a super-library For distributed applications Build-in solutions Solutions depend on this framework Inspired by Google's MapReduce and Google File System (GFS) papers
Who use Hadoop A9.com – Amazon ▪ Amazon's product search indices Adobe ▪ 30 nodes running HDFS, Hadoop and Hbase Baidu ▪ handle about 3000TB per week Facebook ▪ store copies of internal log and dimension data sources Last.fm, LinkedIn, IBM, Yahoo!, Google…
Hadoop Common HDFS MapReduce ZooKeeper
Connections to the IR book Ch.4 Index construction ▪ Distributed indexing (4.4) Ch.20 Web crawling and indexes ▪ Distributed crawler (20.2) ▪ Distributed indexing (20.3)
Is a software framework For distributed computing Mass amount of data Simple processing requirement Portability across variety platforms ▪ Clusters ▪ CMP/SMP ▪ GPGPU Introduced by Google
Cited from MapReduce: Simplified Data Processing on Large Clusters
Map Map(k1,v1) -> list(k2,v2) Reduce Reduce(k2, list (v2)) -> list(v3) Hadoop MapReduce (input) -> map -> -> combine -> -> reduce -> (output)
Source $cat file01 Hello World Bye World $cat file02 Hello Hadoop Goodbye Hadoop $
Map Output For File01 For File02
Reduce Output
More input More mappers Combiner Function after Map More reducers Partition Function before Reduce Focus on Map & Reduce
Hadoop in Java (C++) Run in 3 modes Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode It is setup to Pseudo-Distributed Mode in our instance on IBM cloud
Process 1. Start Hadoop service 2. Prepare input 3. Write your MapReduce program 4. Compile your program 5. Run your application with Hadoop
Start Hadoop service $ bin/hadoop namenode -format $ bin/start-all.sh Initialize filesystem $ bin/hadoop fs -put localdir hinputdir You can also use -get, -rm, -cat with fs
Compile your program & create jar $ javac -classpath ${HADOOP}-core.jar -d wordcount_classes WordCount.java $ jar -cvf wordcount.jar -C wordcount_classes/. Run your application with Hadoop $ bin/hadoop jar wordcount.jar org.myorg.WordCount hinputdir houtputdir
void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int result = 0; for each pc in partialCounts: result += ParseInt(pc); Emit(AsString(result)); Cited from Wikipedia
public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }
public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }
Configurations & Main class Leave other work for the Hadoop MapReduce Framework
Hadoop Introduction Connections to the IR book MapReduce Overview E.g. WordCount Environment configuration Writing your MapReduce application
Hadoop Project MapReduce in Hadoop MapReduce: Simplified Data Processing on Large Clusters 9&part=magazine&WantType=Magazines&title=Communications%20of%20the %20ACM Hadoop Single-Node Setup Who use Hadoop