Wordcount CSCE 587 Spring 2018
Preliminary steps in the VM First: log in to sandbox in vm URL: vm-hadoop-XX.cse.sc.edu:4200 Where: XX is the vm number assigned to you Account: maria_dev Password: qwertyCSCE587 First thing change your password! Ex: passwd You will be prompted for your current password Next you will be prompted for a new password BE SURE TO KEEP TRACK OF THIS PASSWORD!
Preliminary steps in the VM Use wget to transfer a file from the web: Ex: [student@sandbox ~]$ wget https://cse.sc.edu/~rose/587/greeneggsandham.txt g.txt wget – free utility for non-interactive download of files from the Web Source file Destination file
Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put g.txt (Alternatively: hadoop fs -put g.txt /user/maria_dev)
Preliminary steps in the VM Convince yourself by checking the HDFS [maria_dev@sandbox-hdp ~]$ hadoop fs -ls /user/maria_dev Found 3 items drwx------ - maria_dev hdfs 0 2018-04-06 06:00 /user/maria_dev/.Trash drwx------ - maria_dev hdfs 0 2018-04-05 20:39 /user/maria_dev/.staging -rw-r--r-- 1 maria_dev hdfs 1746 2018-04-09 17:40 /user/maria_dev/g.txt
Our first mapreduce program There are three components of our wordcount program: Map ----------- create a count of 1 for each word Reduce ------- Aggregate the counts for each word The command line that puts it all together
Map (in Python) #!/usr/bin/env python import sys for line in sys.stdin: # Get input lines from stdin line = line.strip() # Remove spaces from beginning and end of the line words = line.split() # Split it into words for word in words: # Output tuples on stdout print '%s\t%s' % (word, "1")
Reduce (in Python) #!/usr/bin/env python import sys wordcount = {} # Create a dictionary to map words to counts for line in sys.stdin: # Get input from stdin line = line.strip() #Remove spaces from beginning and end of the line word, count = line.split('\t', 1) # parse the input from mapper.py try: # convert count (currently a string) to int count = int(count) except ValueError: continue try: wordcount[word] = wordcount[word]+count except: wordcount[word] = count for word in wordcount.keys(): # Write the tuples to stdout print '%s\t%s'% ( word, wordcount[word] ) # Currently tuples are unsorted
The Command Line hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper 'python /home/maria_dev/mapper.py' -reducer 'python /home/maria_dev/reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper 'python /home/maria_dev/mapper.py' -reducer 'python /home/maria_dev/reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output
Note: you can not overwrite existing files # if “/user/maria_dev/output" already exists, then the mapreduce job will fail and you will # have to delete “output": # hadoop fs –rm -R /user/maria_dev/output
VM: Check for changes to HDFS [maria_dev@sandbox-hdp ~]$ hadoop fs -ls /user/maria_dev Found 4 items drwx------ - maria_dev hdfs 0 2018-04-06 06:00 /user/maria_dev/.Trash drwx------ - maria_dev hdfs 0 2018-04-09 18:06 /user/maria_dev/.staging -rw-r--r-- 1 maria_dev hdfs 1746 2018-04-09 17:40 /user/maria_dev/g.txt drwxr-xr-x - maria_dev hdfs 0 2018-04-09 18:06 /user/maria_dev/output
Fetch the results from HDFS hadoop fs -cat output/part-00000 and 8 car, 1 Would 6 house. 4 not 31 tree. 2 car. 1 anywhere. 5 in 16 You 4 etc………….