Actores y Actrices. Peligro Please be careful! IMDb (I assume you all know?)

Slides:



Advertisements
Similar presentations
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
How Google would do GREP Spring Google Massive datasets Massive numbers of machines, working in parallel.
MapReduce.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Developing a MapReduce Application – packet dissection.
Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together.
Distributed Computations
Introducing the Command Line CMSC 121 Introduction to UNIX Much of the material in these slides was taken from Dan Hood’s CMSC 121 Lecture Notes.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Week 6 labs. Compute A B or log A B For B an integer, A B can be computed very similarly to N! except instead of building the product 1*2*3*…*N, we build.
Distributed Computations MapReduce
Introduction to Unix – CS 21 Lecture 5. Lecture Overview Lab Review Useful commands that will illustrate today’s lecture Streams of input and output File.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hola Hadoop. 0. Clean-Up The Hard-disks Delete tmp/ folder from workspace/mdp-lab3 Delete unneeded downloads.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
B 葉彥廷 B 林廷韋 B 王頃恩. Why we choose this topic Introduction Programming Model Example Implementation Conclusion.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VI: 2014/04/14.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Kevin Bacon. The Question You are Going to Answer (again) … Which pair of actors/actresses have acted together the most times?
Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –
Big Data,Map-Reduce, Hadoop. Presentation Overview What is Big Data? What is map-reduce? input/output data types why is it useful and where is it used?
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Set up environment for mapreduce developing on Hadoop.
Progress Report 2009/12/15. Add pipe in hadoop For now on hadoop can only do one thing, in one command like bin/hadoop fs –ls Pipes have the potential.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Some slides adapted from those of Yuan Yu and Michael Isard
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
인공지능연구실 이남기 ( ) 유비쿼터스 응용시스템: 실습 가이드 인공지능연구실 이남기 ( )
Hadoop MapReduce Framework
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Big Data Analytics: HW#3
Counting (co-)Stars.
Lecture 3: Bringing it all together
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
WordCount 빅데이터 분산컴퓨팅 박영택.
Hadoop.
Hadoop Basics.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
More advanced BASH usage
Distributed Systems CS
CSE 491/891 Lecture 24 (Hive).
Charles Tappert Seidenberg School of CSIS, Pace University
Hola Hadoop.
Presentation transcript:

Actores y Actrices

Peligro Please be careful!

IMDb (I assume you all know?)

IMDb Dump Not open/free!

The Question You are Going to Answer … Which pair of actors/actresses have acted together the most times?

An Example In how many movies have Al Pacino and Robert Di Nero starred together in IMDb? ?

IMDB: Typical File Log into machine cluster.dcc.uchile.cl Username: uhadoop zcat /data/hadoop/hadoop/data/imdb/actors.list.gz | more

IMDb: Already Parsed zcat /data/hadoop/hadoop/data/imdb/tsv/actpersons-to-movies.tsv.gz | more How many theatrical movies was Uma Thurman in? zcat /data/hadoop/hadoop/data/imdb/tsv/actresses-to-movies.tsv.gz | grep -e “^Thurman, Uma” | grep -e “THEATRICAL_MOVIE” | wc -l

The Question You are Going to Answer … Which pair of actors/actresses have acted together the most times?

1. Download the project

2. Implement the Hadoop job(s)! Adapt WordCount example – Refer to lab slides from last week Can use class file for each part of the task Test on small file – /uhadoop/imdb/actpersons-to-movies.100k.tsv Run on big file – /uhadoop/imdb/full/actpersons-to-movies.tsv Write to your directory!!! – /uhadoop/[username]

3. Continuation Count the pairs – CountPairs.java Sort the pairs – SortPairs.java Figure out the input Figure out the map/reduce phase Adapt a previous example – WordCount or EmitPairs – Change generics – Implement new Map/Reduce Run it!