MapReduce Practice :WordCount

Slides:



Advertisements
Similar presentations
This Time Whitespace and Input/Output revisited The Programming cycle Boolean Operators The “if” control structure LAB –Write a program that takes an integer.
Advertisements

MapReduce.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Files Introduction to Computing Science and Programming I.
1 Exception-Handling Overview Exception: when something unforeseen happens and causes an error Exception handling – improves program clarity by removing.
The Longest Common Substring Problem a.k.a Long Repeat by Donnie Demuth.
OPENMARU dm4ir.tistory.com psyoblade.egloos.com.
 1 Week3: Files and Strings. List List is a sequence of data in any type. [ “Hello”, 1, 3.7, None, True, “You” ] Accessing a list is done by the bracket.
Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Python Mini-Course University of Oklahoma Department of Psychology Day 4 – Lesson 13 Case study: Word play 05/02/09 Python Mini-Course: Day 4 – Lesson.
Writing a MapReduce Program 1. Agenda  How to use the Hadoop API to write a MapReduce program in Java  How to use the Streaming API to write Mappers.
Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드
CS105 Computer Programming PYTHON (based on CS 11 Python track: lecture 1, CALTECH)
Lecture 6 Books: “Hadoop in Action” by Chuck Lam, “An Introduction to Parallel Programming” by Peter Pacheco.
Guide to Programming with Python Chapter Seven Files and Exceptions: The Trivia Challenge Game.
Week 9 : Text processing (Reading and writing files)
(see online resources, e.g. SY306 Web and Databases for Cyber Operations Slide Set #9: CGI with Python.
Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project.
Other Map-Reduce (ish) Frameworks William Cohen. Y:Y=Hadoop+X or Hadoop~=Y What else are people using? – instead of Hadoop – on top of Hadoop.
Word Counter HW Copyright © 2012 Pearson Education, Inc.
CIT 590 Intro to Programming Files etc. Agenda Files Try catch except A module to read html off a remote website (only works sometimes)
CSC 4630 Perl 3 adapted from R. E. Beck. Problem But we worked on it first: Input: Read from a text file named in a command line argument Output: List.
Python Files and Lists. Files  Chapter 9 actually introduces you to opening up files for reading  Chapter 14 has more on file I/O  Python can read.
Computational Methods in Astrophysics Dr Rob Thacker (AT319E)
LECTURE 5 Strings. STRINGS We’ve already introduced the string data type a few lectures ago. Strings are subtypes of the sequence data type. Strings are.
CIT 590 Intro to Programming Lecture 6. Vote in the doodle poll so we can use some fancy algorithm to pair you up You.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Strings in Python String Methods. String methods You do not have to include the string library to use these! Since strings are objects, you use the dot.
Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.
Programming: Input and Output in Python Bruce Beckles University of Cambridge Computing Service Day One.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Chapter 13 Debugging Strategies Learning by debugging.
Reading and writing files
Introduction to Computing Science and Programming I
Concept & Examples of pyspark
Recitation #4 Tel Aviv University 2016/2017 Slava Novgorodov
Project 1 : Who is Popular, and Who is Not.
Input and Output in Java
Hadoop MapReduce Framework
Introduction to Computing Science and Programming I
Containers and Lists CIS 40 – Introduction to Programming in Python
IST256 : Applications Programming for Information Systems
(optional - but then again, all of these are optional)‏
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
Wordcount CSCE 587 Spring 2018.
WordCount 빅데이터 분산컴퓨팅 박영택.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
MapReduce Algorithm Design
Input and Output in Java
Wordcount CSCE 587 Spring 2018.
Coding in the Real World
Map Reduce Workshop Monday November 12th, 2012
Python dicts and sets Some material adapted from Upenn cis391 slides and other sources.
Lists in Python Outputting lists.
VI-SEEM data analysis service
Last Class We Covered Escape sequences File I/O Uses a backslash (\)
IST256 : Applications Programming for Information Systems
Input and Output in Java
Python Lists and Sequences
Sadalage & Fowler (Amazon)
Input and Output in Java
Times.
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Input and Output Python3 Beginner #3.
Bryon Gill Pittsburgh Supercomputing Center
CSE 231 Lab 6.
CSE 231 Lab 5.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Chapter 8 IoT Physical Servers and Cloud Offerings
Presentation transcript:

MapReduce Practice :WordCount 박 영 택 컴퓨터학부

Mapper Execution Input text I am a boy You are a girl Input text mapper.py Mapper output I \t 1 am \t 1 a \t 1 boy \t 1 You \t 1 are \t 1 girl \t 1 #!/usr/bin/env python import sys   #--- get all lines from stdin --- for line in sys.stdin:     #--- remove leading and trailing whitespace---     line = line.strip()     #--- split the line into words ---     words = line.split()     #--- output tuples [word, 1] in tab-delimited format---     for word in words:          print '%s\t%s' % (word, "1")

Reducer Execution Mapper output reducer.py { I : 1, am : 1, a : 2, I \t 1 am \t 1 a \t 1 boy \t 1 You \t 1 Are \t 1 Girl \t 1 #!/usr/bin/env python import sys   word2count = {}   for line in sys.stdin:     # remove leading and trailing whitespace     line = line.strip()       word, count = line.split('\t', 1)     try:         count = int(count)     except ValueError:         continue     try:         word2count[word] = word2count[word]+count     except:         word2count[word] = count  for word in word2count.keys():     print '%s\t%s'% ( word, word2count[word] ) reducer.py { I : 1, am : 1, a : 2, boy : 1, You : 1, are : 1, girl : 1 }

Demo : Shakespeare Shakespeare Shakespeare COUNTESS OF ROUSILLON mother to Bertram. (COUNTESS:) HELENA a gentlewoman protected by the Countess. An old Widow of Florence. (Widow:) DIANA daughter to the Widow. VIOLENTA | | neighbours and friends to the Widow. MARIANA | Shakespeare comedies(1.7mb) glossary(57kb) histories(1.4mb) poems(262kb) tragedies(1.7mb) Around : 5 mb