Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

1 Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010

2  Introduction  Hadoop  MapReduce  Working With Hadoop  Environment  MapReduce Programming  Summary

3  Is a software framework  User should program  Like a super-library  For distributed applications  Build-in solutions  Solutions depend on this framework  Inspired by Google's MapReduce and Google File System (GFS) papers

4  Who use Hadoop  – Amazon ▪ Amazon's product search indices  Adobe ▪ 30 nodes running HDFS, Hadoop and Hbase  Baidu ▪ handle about 3000TB per week  Facebook ▪ store copies of internal log and dimension data sources , LinkedIn, IBM, Yahoo!, Google…

5  Hadoop Common  HDFS  MapReduce  ZooKeeper

6  Connections to the IR book  Ch.4 Index construction ▪ Distributed indexing (4.4)  Ch.20 Web crawling and indexes ▪ Distributed crawler (20.2) ▪ Distributed indexing (20.3)

7  Is a software framework  For distributed computing  Mass amount of data  Simple processing requirement  Portability across variety platforms ▪ Clusters ▪ CMP/SMP ▪ GPGPU  Introduced by Google

9  Map  Map(k1,v1) -> list(k2,v2)  Reduce  Reduce(k2, list (v2)) -> list(v3)  Hadoop MapReduce  (input) -> map -> -> combine -> -> reduce -> (output)

10  Source $cat file01 Hello World Bye World $cat file02 Hello Hadoop Goodbye Hadoop $

11  Map Output  For File01  For File02

12  Reduce Output

13  More input  More mappers  Combiner Function after Map  More reducers  Partition Function before Reduce  Focus on Map & Reduce

14  Hadoop in Java (C++)  Run in 3 modes  Local (Standalone) Mode  Pseudo-Distributed Mode  Fully-Distributed Mode  It is setup to Pseudo-Distributed Mode in our instance on IBM cloud

15  Process 1. Start Hadoop service 2. Prepare input 3. Write your MapReduce program 4. Compile your program 5. Run your application with Hadoop

16  Start Hadoop service  $ bin/hadoop namenode -format  $ bin/  Initialize filesystem  $ bin/hadoop fs -put localdir hinputdir  You can also use -get, -rm, -cat with fs

17  Compile your program & create jar  $ javac -classpath ${HADOOP}-core.jar -d wordcount_classes  $ jar -cvf wordcount.jar -C wordcount_classes/.  Run your application with Hadoop  $ bin/hadoop jar wordcount.jar org.myorg.WordCount hinputdir houtputdir

18 void map(String name, String document): // name: document name // document: document contents for each word w in document: EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts int result = 0; for each pc in partialCounts: result += ParseInt(pc); Emit(AsString(result)); Cited from Wikipedia

19 public static class Map extends MapReduceBase implements Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); }

20 public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum +=; } output.collect(key, new IntWritable(sum)); }

21  Configurations & Main class Leave other work for the Hadoop MapReduce Framework

22  Hadoop  Introduction  Connections to the IR book  MapReduce  Overview  E.g. WordCount  Environment configuration  Writing your MapReduce application

23  Hadoop Project  MapReduce in Hadoop  MapReduce: Simplified Data Processing on Large Clusters 9&part=magazine&WantType=Magazines&title=Communications%20of%20the %20ACM  Hadoop Single-Node Setup  Who use Hadoop


