Map Reduce Workshop Monday November 12th, 2012

Slides:



Advertisements
Similar presentations
Software Installation Deck Big Data Workshop Saturday March 10 th, 2012.
Advertisements

Endnote Tutorial The Version pictured is version 9.0 May 8, 2007.
AudioBoo Because sound is social. Overview Instruct how to create an Audioboo account Demonstrate how to follow a featured boo Learn how to upload a video.
Downloading and Installing AutoCAD Architecture 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the software.
CPSC 203 Introduction to Computers Lab 21, 22 by Jie (Jeff) Gao Location: ES650.
Introduction to MVNU Technology aka IT 101. Welcome to IT 101! Our plan is to introduce you to a few basic technology items. We designed this to be short.
Reproducible Environment for Scientific Applications (Lab session) Tak-Lon (Stephen) Wu.
Advanced Facebook Ad’s Power Editor 101. Power Editor can be found in your Ad’s Manager Section.
Panorama High School E.G.P./ Training to Put Students’ Grades on the Website Wednesday, September 29,
Welcome to the Second Tutorial Welcome to the second part of this communication system website tutorial! This tutorial is for church planters. When you.
+ Working in Your CCE Online Course Site. + Structure of CCE Online Course Sites CCE online courses use the document sharing and collaboration features.
With Overdrive. What you will need: Valid address Library Card (or #) Device (connected to internet) Amazon/Nook Log-In Info Adobe Open Editions.
Guidelines for Homework 6. Getting Started Homework 6 requires that you complete Homework 5. –All of HW5 must run on the GridFarm. –HW6 may run elsewhere.
Parent Portal Also known as: The next best thing to being at school with your student!
Sonia Kalwaney My Wiki site s.com.
| nectar.org.au NECTAR TRAINING Module 10 Beyond the Dashboard.
Client – Server Application Can you create a client server application: The server will be running as a service: does not have a GUI The server will run.
CPSC 217 T03 Week I Part #1: Unix and HELLO WORLD Hubert (Sathaporn) Hu.
MyBLAST standalone installation Lab of Systems Biology & Network Biology website Download link. 1.Click the Download link.
Downloading and Installing Autodesk Revit 2016
Forms and Server Side Includes. What are Forms? Forms are used to get user input We’ve all used them before. For example, ever had to sign up for courses.
Setting up and getting going with…. MIT App Inventor.
Unit 1 – Improving Productivity Matthew Hazzard. 1.1Why did you use a computer? What other systems / resources could you have used? I used a computer.
Downloading and Installing Autodesk Inventor Professional 2015 This is a 4 step process 1.Register with the Autodesk Student Community 2.Downloading the.
TECHNICAL ORIENTATION WINTER Technical Orientation Session starts at 2:00 pm We’ll be online shortly Speaker test starts about 1:45 To ask questions,
The Homepage My Campaign is where you can track your contacts information.
CPSC 203 Introduction to Computers T43, T46 & T68 TA: Jie (Jeff) Gao.
HTML Help For MGS 351 Final Project Website. Agenda Getting Started – Must-Do’s – Working from an off-campus computer – Other Resources Working with HTML.
Practical Kinetics Exercise 0: Getting Started Objectives: 1.Install Python and IPython Notebook 2.print “Hello World!”
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Schoolwires – District 205 created by Andrew Chidester.
Special Education Teachers and Speech Language Pathologist Effective Technology Tools By: Beth Fulks, June 23, 2014.
How to create a website from scratch.  You should have an internet access.  Visit  You need to create a new account OR.
With Weebly.com. What hoop do I have to jump through to create my own site? Is it expensive? Is it time consuming? Do I have to be tech savvy? Will it.
WHAT ARE THE STEPS TO CONNECT MY HP DESKJET 3520 TO WI-FI?
Amazon Web Services (aws)
Development Environment
Git & Github Timothy McRoy.
The Version pictured is version 9.0 May 8, 2007
Lab Introduction Installing Python
Registering and filing IRS Form 990 e-Postcard
Student Monmouth College
Data Virtualization Tutorial… OAuth Example using Google Sheets
SI-U: Decompiling SWF Files May, 03, 2011 Jeremy Loughnot
Getting started on PowerSchool (Our attendance and grading program)
Introduction to Lime Survey
OverDrive Digital Library Basics
Explain what touch develop is to your students:
Week 1 Gates Introduction to Information Technology cosc 010 Week 1 Gates
Mendeley Download Instructions
Collaboration with Google Docs
office.com/setup installation and Activation
OverDrive Digital Library Basics
Managing Your Literature Search Using Zotero
Welcome to CS 1010! Algorithmic Problem Solving.
Welcome to CS 1010! Algorithmic Problem Solving.
How to create a Parent PowerSchool ID and Access Student Pages
E-permits Tutorial for first-time users
Lecture 16B: Instructions on how to use Hadoop on Amazon Web Services
Adding information to provider pages
Monday 28th November 2016.
Mendeley Download Instructions
Introduction to RefWorks
Explain what touch develop is to your students:
Soon we will have a new platform to help you stay in control of your independent learning. Introducing…….
eHarmony for Recruiters
CSCE 206 Lab Structured Programming in C
CSCI 203: Introduction to Computer Science I
Lab 1: D3 Setup John Fallon
Presentation transcript:

Map Reduce Workshop Monday November 12th, 2012 Pre-Workshop Deck Map Reduce Workshop Monday November 12th, 2012

Overview: Schedule 5:30-6:00 Networking, Software Install, Cloud Setup 6:00-6:10 M/R and Workshop Overview - John Verostek 6:10-7:20 Map/Reduce Tutorial - Vipin Sachdeva (IBM Research Labs) The Map/Reduce Programming Framework will be introduced using a hands-on Word Count example using Python. Next the basics of Hadoop Map/Reduce and File Server will be covered. Time permitting, a demo will be given of running the Python M/R program using Hadoop installed locally. 7:20-7:30 Short Break 7:30-8:45 Applications using Amazon Elastic M/R - J Singh (DataThinks.org) A Facebook application will also be walked through. For this dataset, everyone who attends the workshop will have the option to sign into a workshop prep page with their Facebook account and give permission to share their likes. The data is automatically anonymized and sent to an Amazon S3 file. The exercise will find likes common to people in the sample. What might someone do after the analysis of such data? Design an advertising campaign, perhaps (but designing an ad campaign is not part of the workshop).

Pre-Workshop Checklist Cloud: Amazon Web Services to be used Facebook Likes Exercise: App used to anonymously collect “Likes” Dataset to be used for M/R exercise Software Installation: Python will be used to run programs locally Download 2.7.3, set Environmental Variables Code and Datasets: Included are links to Files located up on Amazon Running Python Locally: Various screen shots to try out a regular (not a Map / Reduce) wordcount program

Cloud Account Please sign-up for an account here: http://aws.amazon.com/ Amazon’s Elastic Map/Reduce will be used. The following 6-minute AWS video shows a wordcount example that is somewhat similar to what will be used in the workshop. There is enough info in this video to within 15-minutes run a M/R job. http://www.youtube.com/watch?v=kNsS9aDf6uE

We will be using Elastic MapReduce and S3

Python Scripts and Wikipedia Datasets What URL Word Counter (Non-Map/Reduce) https://s3.amazonaws.com/python-code/seq.py Word Counter Mapper https://s3.amazonaws.com/python-code/mapper.py Sorter for Windows-machines https://s3.amazonaws.com/python-code/sorter.py Word Counter Reducer https://s3.amazonaws.com/python-code/reducer.py https://s3.amazonaws.com/python-code/reducer-all.py sorter.py for folks with Windows machines After loading the link into the browser, and text appearing, then use “right-click”, “save as” Dataset Size URL Very Small 2 KB https://s3.amazonaws.com/workshop-verysmall/input-very-small.txt Small 10 MB https://s3.amazonaws.com/workshop-small/input-small.txt Medium 76 MB https://s3.amazonaws.com/workshop-medium/input-medium.txt Large 1 GB https://s3.amazonaws.com/workshop-large/input-large.txt Very Large 8 GB https://s3.amazonaws.com/workshop-verylarge/input-very-large.txt These five files of different sizes were created by Vipin to test out the time to run each one locally. Please note the 8GB may take a while to download.

Facebook Likes Exercise Please sign in into a workshop prep page with your Facebook account and give permission to share their likes. If you don't have a Facebook account then no worries. If everyone opts out, we won't have much data to work with. All collected data will be anonymized and then deleted after the workshop is done. The exercise will find likes common to people in the sample. What might someone do after the analysis of such data? Design an advertising campaign, perhaps (though designing an ad campaign is not part of the workshop). The App that collects Facebook data is: http://apps.facebook.com/map-reduce-workshop

Should be where you can just copy and paste the URL into the browser where you have Facebook set up Then something like this should appear after it has pulled over the Likes.

Python Download Mac/Linux comes installed with Python (should be able to run). Windows : if you do not already have Python installed, then use the following website to download and install: http://www.python.org/download/ DOWNLOAD VERSION 2.7.3 DO NOT DOWNLOAD VERSION 3.3.0

Python Installation

Python Environmental Variables There are many online instructions that explain this such as the below link: http://pythoncentral.org/how-to-install-python-2-7-on-windows-7-python-is-not-recognized-as-an-internal-or-external-command/

More Python Help http://docs.python.org/2/using/

We will be using the Command Line Get to the DIR where your code/data is located CD to call a directory CD.. to go up a level e.g. cd users\john\desktop\python Ctrl C to kill

Wordcount - Very Simple Python Script – seq.py #!/usr/bin/env python import sys import re counts={} for line in sys.stdin: words = re.split("\W+",line) for word in words: counts[word]=counts.get(word,0)+1; for x,y in counts.items(): print x,y Import system functions and regular expressions library Read from stdin Split along newline character (W+) Let’s try running it!! Python Code seq.py Use Wikipedia Dataset filename Output File – Can be any name > python ./seq.py < input.txt > output.txt Make sure you have the < and > pointing in the right directions or could over-write input file.

> python ./seq.py < input.txt No Output File so goes to screen > python ./seq.py < input.txt CTRL-C to break

Time Mac OS/Linux/Cygwin > time python ./seq.py < input.txt Windows If you run the datasets locally, and “time” how long they take (output to file not screen), then then please email me the results (including laptop info, etc) as we are making a composite slide of “time vs. file size.” Category Size Time Very small 2KB Small 10 MB Medium 80 MB Large 1GB Very large 8 GB My Windows laptop took 4 hours* * I noted the start time, 11:46 PM and then ran overnight and went with the timestamp on the output file; 03:54 AM.