Wordcount CSCE 587 Spring 2018.

Slides:



Advertisements
Similar presentations
MapReduce.
Advertisements

Developing a MapReduce Application – packet dissection.
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Working with Files How to create, view, copy, rename and print files.
Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Jian Wang Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das Yahoo! Inc. Bangalore & Apache Software Foundation.
The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Actores y Actrices. Peligro Please be careful! IMDb (I assume you all know?)
Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.
Tutorial on Hadoop Environment for ECE Login to the Hadoop Server Host name: , Port: If you are using Linux, you could simply.
Agenda Basic Shell Operations Standard Input / Output / Error Redirection of Standard Input / Output / Error ( >, >>,
Introduction to Computer Organization & Systems Topics: Intro to UNIX COMP John Barr.
HAMS Technologies 1
PROGRAMMING PROJECT POLICIES AND UNIX INTRO Sal LaMarca CSCI 1302, Fall 2009.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
1 Operating Systems Lecture 2 UNIX and Shell Scripts.
Quiz 15 minutes Open note, open book, open computer Finding the answer – working to get it – is what helps you learn I don’t care how you find the answer,
Linux Essentials Programming and Data Structures Lab M Tech CS – I 2014 Arijit Bishnu Ansuman Banerjee Debapriyo Majumdar.
Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드
Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
Working with Hadoop. Requirement Virtual machine software –VM Ware –VirtualBox Virtual machine images –Download from Cloudera (Founded by leaders in the.
Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.
Tutorial: To run the MapReduce EEMD code with Hadoop on Futuregrid -by Rewati Ovalekar.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Computational Methods in Astrophysics Dr Rob Thacker (AT319E)
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Hadoop Architecture Mr. Sriram
Getting started with CentOS Linux
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Set up environment for mapreduce developing on Hadoop
Large-scale file systems and Map-Reduce
Hadoop MapReduce Framework
Hadoop: what is it?.
Counting (co-)Stars.
Assignment Preliminaries
Storing, Sending, and Tracking Files Recitation 2
Calculation of stock volatility using Hadoop and map-reduce
Pig Latin - A Not-So-Foreign Language for Data Processing
HIVE CSCE 587 Spring 2018.
Airlinecount CSCE 587 Fall 2017.
Files I/O, Streams, I/O Redirection, Reading with fscanf
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
WordCount 빅데이터 분산컴퓨팅 박영택.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Hadoop Distributed Filesystem
Hadoop Basics.
VM Terminal Sessions.
Wordcount CSCE 587 Spring 2018.
Map Reduce Workshop Monday November 12th, 2012
Python dicts and sets Some material adapted from Upenn cis391 slides and other sources.
Getting started with CentOS Linux
Lecture 18 (Hadoop: Programming Examples)
CSE 491/891 Lecture 21 (Pig).
VI-SEEM data analysis service
Lecture 16 (Intro to MapReduce and Hadoop)
CSE 491/891 Lecture 24 (Hive).
Python Basics with Jupyter Notebook
MapReduce Practice :WordCount
Bryon Gill Pittsburgh Supercomputing Center
Hola Hadoop.
Presentation transcript:

Wordcount CSCE 587 Spring 2018

Preliminary steps in the VM First: log in to sandbox in vm URL: vm-hadoop-XX.cse.sc.edu:4200 Where: XX is the vm number assigned to you Account: maria_dev Password: qwertyCSCE587 First thing change your password! Ex: passwd You will be prompted for your current password Next you will be prompted for a new password BE SURE TO KEEP TRACK OF THIS PASSWORD!

Preliminary steps in the VM Use wget to transfer a file from the web: Ex: [student@sandbox ~]$ wget https://cse.sc.edu/~rose/587/greeneggsandham.txt g.txt wget – free utility for non-interactive download of files from the Web Source file Destination file

Preliminary steps in the VM Transfer the file from the vm linux filesystem to the Hadoop filesystem hadoop fs -put g.txt (Alternatively: hadoop fs -put g.txt /user/maria_dev)

Preliminary steps in the VM Convince yourself by checking the HDFS [maria_dev@sandbox-hdp ~]$ hadoop fs -ls /user/maria_dev Found 3 items drwx------ - maria_dev hdfs 0 2018-04-06 06:00 /user/maria_dev/.Trash drwx------ - maria_dev hdfs 0 2018-04-05 20:39 /user/maria_dev/.staging -rw-r--r-- 1 maria_dev hdfs 1746 2018-04-09 17:40 /user/maria_dev/g.txt

Our first mapreduce program There are three components of our wordcount program: Map ----------- create a count of 1 for each word Reduce ------- Aggregate the counts for each word The command line that puts it all together

Map (in Python) #!/usr/bin/env python import sys for line in sys.stdin: # Get input lines from stdin line = line.strip() # Remove spaces from beginning and end of the line words = line.split() # Split it into words for word in words: # Output tuples on stdout print '%s\t%s' % (word, "1")

Reduce (in Python) #!/usr/bin/env python import sys wordcount = {} # Create a dictionary to map words to counts for line in sys.stdin: # Get input from stdin line = line.strip() #Remove spaces from beginning and end of the line word, count = line.split('\t', 1) # parse the input from mapper.py try: # convert count (currently a string) to int count = int(count) except ValueError: continue try: wordcount[word] = wordcount[word]+count except: wordcount[word] = count for word in wordcount.keys(): # Write the tuples to stdout print '%s\t%s'% ( word, wordcount[word] ) # Currently tuples are unsorted

The Command Line hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper 'python /home/maria_dev/mapper.py' -reducer 'python /home/maria_dev/reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper 'python /home/maria_dev/mapper.py' -reducer 'python /home/maria_dev/reducer.py' -input /user/maria_dev/g.txt -output /user/maria_dev/output

Note: you can not overwrite existing files # if “/user/maria_dev/output" already exists, then the mapreduce job will fail and you will # have to delete “output": # hadoop fs –rm -R /user/maria_dev/output

VM: Check for changes to HDFS [maria_dev@sandbox-hdp ~]$ hadoop fs -ls /user/maria_dev Found 4 items drwx------ - maria_dev hdfs 0 2018-04-06 06:00 /user/maria_dev/.Trash drwx------ - maria_dev hdfs 0 2018-04-09 18:06 /user/maria_dev/.staging -rw-r--r-- 1 maria_dev hdfs 1746 2018-04-09 17:40 /user/maria_dev/g.txt drwxr-xr-x - maria_dev hdfs 0 2018-04-09 18:06 /user/maria_dev/output

Fetch the results from HDFS hadoop fs -cat output/part-00000 and 8 car, 1 Would 6 house. 4 not 31 tree. 2 car. 1 anywhere. 5 in 16 You 4 etc………….