Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wordcount CSCE 587 Spring 2018.

Similar presentations


Presentation on theme: "Wordcount CSCE 587 Spring 2018."— Presentation transcript:

1 Wordcount CSCE 587 Spring 2018

2 Preliminary steps in the VM
First: log in to sandbox in vm URL: vm-hadoop-XX.cse.sc.edu:4200 Where: XX is the vm number assigned to you Account: maria_dev Password: qwertyCSCE587 First thing change your password! Ex: passwd You will be prompted for your current password Next you will be prompted for a new password BE SURE TO KEEP TRACK OF THIS PASSWORD!

3 Preliminary steps in the VM
Use wget to transfer a file from the web: Ex: ~]$ wget g.txt wget – free utility for non-interactive download of files from the Web Source file Destination file

4 Preliminary steps in the VM
Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put g.txt (Alternatively: hadoop fs -put g.txt /user/maria_dev)

5 Preliminary steps in the VM
Convince yourself by checking the HDFS ~]$ hadoop fs -ls /user/maria_dev Found 3 items drwx maria_dev hdfs :00 /user/maria_dev/.Trash drwx maria_dev hdfs :39 /user/maria_dev/.staging -rw-r--r maria_dev hdfs :40 /user/maria_dev/g.txt

6 Our first mapreduce program
There are three components of our wordcount program: Map create a count of 1 for each word Reduce Aggregate the counts for each word The command line that puts it all together

7 Map (in Python) #!/usr/bin/env python import sys for line in sys.stdin: # Get input lines from stdin line = line.strip() # Remove spaces from beginning and end of the line words = line.split() # Split it into words for word in words: # Output tuples on stdout print '%s\t%s' % (word, "1")

8 Reduce (in Python) #!/usr/bin/env python import sys wordcount = {} # Create a dictionary to map words to counts for line in sys.stdin: # Get input from stdin line = line.strip() #Remove spaces from beginning and end of the line word, count = line.split('\t', 1) # parse the input from mapper.py try: # convert count (currently a string) to int count = int(count) except ValueError: continue try: wordcount[word] = wordcount[word]+count except: wordcount[word] = count for word in wordcount.keys(): # Write the tuples to stdout print '%s\t%s'% ( word, wordcount[word] ) # Currently tuples are unsorted

9 The Command Line hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper 'python /home/maria_dev/mapper.py' -reducer 'python /home/maria_dev/reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper 'python /home/maria_dev/mapper.py' -reducer 'python /home/maria_dev/reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output

10 Note: you can not overwrite existing files
# if “/user/maria_dev/output" already exists, then the mapreduce job will fail and you will # have to delete “output": # hadoop fs –rm -R /user/maria_dev/output

11 VM: Check for changes to HDFS
~]$ hadoop fs -ls /user/maria_dev Found 4 items drwx maria_dev hdfs :00 /user/maria_dev/.Trash drwx maria_dev hdfs :06 /user/maria_dev/.staging -rw-r--r maria_dev hdfs :40 /user/maria_dev/g.txt drwxr-xr-x - maria_dev hdfs :06 /user/maria_dev/output

12 Fetch the results from HDFS
hadoop fs -cat output/part and 8 car, 1 Would 6 house. 4 not 31 tree. 2 car. 1 anywhere. 5 in 16 You 4 etc………….


Download ppt "Wordcount CSCE 587 Spring 2018."

Similar presentations


Ads by Google