Wordcount CSCE 587 Spring 2018.

Slides:



Advertisements
Similar presentations
Developing a MapReduce Application – packet dissection.
Advertisements

O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Jump to first page Unix Commands Monica Stoica Jump to first page Introduction to Unix n Unix was born in 1969 at Bell Laboratories, a research subdivision.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Working with Files How to create, view, copy, rename and print files.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.
Actores y Actrices. Peligro Please be careful! IMDb (I assume you all know?)
Learning basic Unix command IT 325 operating system.
Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.
Tutorial on Hadoop Environment for ECE Login to the Hadoop Server Host name: , Port: If you are using Linux, you could simply.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Working with Files Chapter 5. Display a Calendar Display a calendar for a specific month – cal Display a calendar for a specific year – cal 2000.
Agenda Basic Shell Operations Standard Input / Output / Error Redirection of Standard Input / Output / Error ( >, >>,
Introduction to Computer Organization & Systems Topics: Intro to UNIX COMP John Barr.
HAMS Technologies 1
PROGRAMMING PROJECT POLICIES AND UNIX INTRO Sal LaMarca CSCI 1302, Fall 2009.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
1 Operating Systems Lecture 2 UNIX and Shell Scripts.
Quiz 15 minutes Open note, open book, open computer Finding the answer – working to get it – is what helps you learn I don’t care how you find the answer,
Linux Essentials Programming and Data Structures Lab M Tech CS – I 2014 Arijit Bishnu Ansuman Banerjee Debapriyo Majumdar.
Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드
Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Linux Lecture #02. File Related Commands cat --Concatenate and print (display) the content of files. --Also used to create a new file. Syntax cat [Options]
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
Working with Hadoop. Requirement Virtual machine software –VM Ware –VirtualBox Virtual machine images –Download from Cloudera (Founded by leaders in the.
Tutorial: To run the MapReduce EEMD code with Hadoop on Futuregrid -by Rewati Ovalekar.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Computational Methods in Astrophysics Dr Rob Thacker (AT319E)
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
CS 120 Extra: The CS1 Server Tarik Booker CS 120.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
File permissions Operating systems I800
Getting started with CentOS Linux
Linux 101 Training Module Linux Basics.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Set up environment for mapreduce developing on Hadoop
Hadoop MapReduce Framework
Hadoop: what is it?.
Assignment Preliminaries
Calculation of stock volatility using Hadoop and map-reduce
HIVE CSCE 587 Spring 2018.
Airlinecount CSCE 587 Fall 2017.
Wordcount CSCE 587 Spring 2018.
Files I/O, Streams, I/O Redirection, Reading with fscanf
湖南大学-信息科学与工程学院-计算机与科学系
WordCount 빅데이터 분산컴퓨팅 박영택.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Hadoop Distributed Filesystem
MapReduce Algorithm Design
VM Terminal Sessions.
Map Reduce Workshop Monday November 12th, 2012
Getting started with CentOS Linux
Lecture 18 (Hadoop: Programming Examples)
CSE 491/891 Lecture 21 (Pig).
VI-SEEM data analysis service
Lecture 16 (Intro to MapReduce and Hadoop)
Notes about Homework #4 Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming:
CSE 491/891 Lecture 24 (Hive).
Python Basics with Jupyter Notebook
MapReduce Practice :WordCount
Bryon Gill Pittsburgh Supercomputing Center
Hola Hadoop.
CSCE 206 Lab Structured Programming in C
Presentation transcript:

Wordcount CSCE 587 Spring 2018

Preliminary steps in the VM First: log in to sandbox in vm URL: vm-hadoop-XX.cse.sc.edu:4200 Where: XX is the vm number assigned to you Account: maria_dev Password: qwertyCSCE587 First thing change your password! Ex: passwd You will be prompted for your current password Next you will be prompted for a new password BE SURE TO KEEP TRACK OF THIS PASSWORD!

Preliminary steps in the VM Use wget to transfer a file from the web: Ex: [student@sandbox ~]$ wget https://cse.sc.edu/~rose/587/greeneggsandham.txt wget – free utility for non-interactive download of files from the Web Source file 2. To save future typing let use a less verbose file name [student@sandbox ~]$ mv greeneggsandham.txt g.txt

Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put g.txt (Alternatively: hadoop fs -put g.txt /user/maria_dev)

Preliminary steps in the VM Convince yourself by checking the HDFS [maria_dev@sandbox-hdp ~]$ hadoop fs -ls /user/maria_dev Found 3 items drwx------ - maria_dev hdfs 0 2018-04-06 06:00 /user/maria_dev/.Trash drwx------ - maria_dev hdfs 0 2018-04-05 20:39 /user/maria_dev/.staging -rw-r--r-- 1 maria_dev hdfs 1746 2018-04-09 17:40 /user/maria_dev/g.txt

Our first mapreduce program There are three components of our wordcount program: Map ----------- create a count of 1 for each word Reduce ------- Aggregate the counts for each word The command line that puts it all together

Map (in Python) #!/usr/bin/env python import sys for line in sys.stdin: # Get input lines from stdin line = line.strip() # Remove spaces from beginning and end of the line words = line.split() # Split it into words for word in words: # Output tuples on stdout print '%s\t%s' % (word, "1")

Reduce (in Python) #!/usr/bin/env python import sys wordcount = {} # Create a dictionary to map words to counts for line in sys.stdin: # Get input from stdin line = line.strip() #Remove spaces from beginning and end of the line word, count = line.split('\t', 1) # parse the input from mapper.py try: # convert count (currently a string) to int count = int(count) except ValueError: continue try: wordcount[word] = wordcount[word]+count except: wordcount[word] = count for word in wordcount.keys(): # Write the tuples to stdout print '%s\t%s'% ( word, wordcount[word] ) # Currently tuples are unsorted

The Command Line hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -file ./mapper.py -mapper 'python mapper.py' -file ./reducer.py -reducer 'python reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -file ./mapper.py -mapper 'python mapper.py' -file ./reducer.py -reducer 'python reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output

Note: you can not overwrite existing files # if “/user/maria_dev/output" already exists, then the mapreduce job will fail and you will # have to delete “output": # hadoop fs –rm -R /user/maria_dev/output

VM: Check for changes to HDFS [maria_dev@sandbox-hdp ~]$ hadoop fs -ls /user/maria_dev Found 4 items drwx------ - maria_dev hdfs 0 2018-04-06 06:00 /user/maria_dev/.Trash drwx------ - maria_dev hdfs 0 2018-04-09 18:06 /user/maria_dev/.staging -rw-r--r-- 1 maria_dev hdfs 1746 2018-04-09 17:40 /user/maria_dev/g.txt drwxr-xr-x - maria_dev hdfs 0 2018-04-09 18:06 /user/maria_dev/output

Fetch the results from HDFS hadoop fs -cat output/part-00000 and 8 car, 1 Would 6 house. 4 not 31 tree. 2 car. 1 anywhere. 5 in 16 You 4 etc………….