Presentation is loading. Please wait.

Presentation is loading. Please wait.

Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.

Similar presentations


Presentation on theme: "Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort."— Presentation transcript:

1 Map reduce with Hadoop streaming and/or Hadoop

2 Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort Java, Unix File System, … Streaming Hadoop Job Streaming Hadoop Job Hadoop FileSystem HS Mapper HS Reducer Adaptor … API: text lines in, lines out

3 Stream & Sort Unix file system SS Mapper SS Reducer cat input | mapper | sort | reducer > output API: text lines in, lines out mapper = java [-cp myJar.jar] MyMainClass1 reducer = java [-cp myJar.jar] MyMainClass2

4 Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort Java, Unix File System, … Streaming Hadoop Job Streaming Hadoop Job Hadoop FileSystem HS Mapper HS Reducer Adaptor … API: text lines in, lines out mapper = java [-cp myJar.jar] MyMainClass1 reducer = java [-cp myJar.jar] MyMainClass2

5 Breaking this down… What actually is a key-value pair? How do you interface with Hadoop? One very simple way: Hadoop’s streaming interface. Mapper outputs key-value pairs as: One pair per line, key and value tab-separated Reduced reads in data in the same format Lines are sorted so lines with the same key are adjacent.

6 An example: SmallStreamNB.java and StreamSumReducer.java: the code you just wrote.

7 To run locally:

8 To train with streaming Hadoop: First, you need to prepare the corpus by splitting it into shards … and distributing the shards to different machines:

9

10 To train with streaming Hadoop: One way to shard text: hadoop fs -put LocalFileName HDFSName then run a streaming job with ‘cat’ as mapper and reducer and specify the number of shards you want with - numReduceTasks

11 To train with streaming Hadoop: Next, prepare your code for upload and distribution to the machines cluster

12 To train with streaming Hadoop: Next, prepare your code for upload and distribution to the machines cluster

13 Now you can run streaming Hadoop:

14 on EC2, what’s different? (aside from interface) s3://…./myjar.jar

15

16 “Real” Hadoop Streaming is simple but There’s no typechecking of inputs/outputs You need to parse strings a lot You can’t use compact binary encodings … basically you have limited control over what you’re doing

17 others: KeyValueInputFormat SequenceFileInputFormat

18

19


Download ppt "Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort."

Similar presentations


Ads by Google