Project 1 : Who is Popular, and Who is Not.

Slides:

Advertisements

Similar presentations

Beyond Mapper and Reducer

Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.

Hui Li Pig Tutorial Hui Li Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012.

CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Developing a MapReduce Application – packet dissection.

Working with pig Cloud computing lecture. Purpose  Get familiar with the pig environment  Advanced features  Walk though some examples.

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3 rd meeting of WHUG June 21, 2012.

CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture VII: 2014/04/21.

Map reduce with Hadoop streaming and/or Hadoop. Hadoop Job Hadoop Mapper Hadoop Reducer Partitioner Hadoop FileSystem Combiner Shuffle Sort Shuffle Sort.

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

HADOOP ADMIN: Session -2

Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:

Cloud Computing Other High-level parallel processing languages Keke Chen.

Big Data Analytics Training

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

HAMS Technologies 1

Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013.

MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Presented by Priagung Khusumanegara Prof. Kyungbaek Kim

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Big data analytics with R and Hadoop Chapter 4 Using HadoopStreaming with R 컴퓨터과학과 SE 연구실 아마르멘드

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

Before we start, please download: VirtualBox: – The Hortonworks Data Platform: –

Map-Reduce Big Data, Map-Reduce, Apache Hadoop SoftUni Team Technical Trainers Software University

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.

MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

What is Pig ???. Why Pig ??? MapReduce is difficult to program. It only has two phases. Put the logic at the phase. Too many lines of code even for simple.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Sort in MapReduce. MapReduce Block 1 Block 2 Block 3 Block 4 Block 5 Map Reduce Output 1 Output 2 Shuffle/Sort.

A Simple Approach for Author Profiling in MapReduce

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

Concept & Examples of pyspark

Introduction to Google MapReduce

Unit 5 Working with pig.

Ch 8 and Ch 9: MapReduce Types, Formats and Features

MapReduce Types, Formats and Features

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Big Data Analytics: HW#3

Introduction to MapReduce and Hadoop

Calculation of stock volatility using Hadoop and map-reduce

Central Florida Business Intelligence User Group

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Cloud Distributed Computing Environment Hadoop

CS6604 Digital Libraries IDEAL Webpages Presented by

MapReduce: Data Distribution for Reduce

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Validation of Ebola LOD

Lecture 18 (Hadoop: Programming Examples)

CSE 491/891 Lecture 21 (Pig).

Data processing with Hadoop

VI-SEEM data analysis service

Charles Tappert Seidenberg School of CSIS, Pace University

MAPREDUCE TYPES, FORMATS AND FEATURES

MapReduce Algorithm Design

(Hadoop) Pig Dataflow Language

Hadoop – PIG.

5/7/2019 Map Reduce Map reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

(Hadoop) Pig Dataflow Language

Map Reduce, Types, Formats and Features

Presentation transcript:

Project 1 : Who is Popular, and Who is Not. Angel Trifonov Anh Pham Xiao Qin

Tasks Task b, c both in Pig and Java Task h in Java

Task b in Java Write a job(s) that reports for each country, how many of its citizens have a Facebook page. Single map-reduce job Input: MyPage datasets Mapper: examine each file line-by-line Each line converted to a string String is split using “,” delimiter Extract nationality and map to an IntWriteable Reducer: take all pairs and sum values for each key Output: number of users per nationality Single reducer

Task b in Pig Group Mypage dataset based on Country code: countrygrp = group mypage by cc; Report number of people that have Facebook page for each country: taskb = foreach countrygrp generate group, COUNT(mypage.id); dump taskb; Running Time Comparison: Plain MapReduce: 1 min 36 sec (Job time) Pig: 24sec (Job time)

Task c in Java Find the top 10 interesting Facebook pages, namely, those that got the most accesses based on your AccessLog dataset compared to all other pages. Hadoop Settings: multiple mappers and one reducer. (setNumReduceTasks(1)) Input: AccessLog 1st round: Mapper(s): Parse the input data. Get the WhatPage. Set WhatPage as the key and a constant number 1 as the value. Reducer: For each key, sum up the total value. Set the WhatPage as the key and the total count as the value 2nd round: Swap the key and value (InverseMapper.class) Output: [Count] , [WhatPage] (in descending order )

Task c in Pig Group the Accesslog dataset based on accessed facebook ID: access_fid_grp = group alog by fid; Get the access count for each accessed facebook ID: grpcnt = foreach access_fid_grp generate group,COUNT(alog.aid) as alogcnt; Order the count descending: grporder = order grpcnt by alogcnt desc; List top 10: taskc = limit grporder 10; dump taskc; Running Time Comparison: Plain MapReduce: 2 min 1 sec(Job time) Pig: 1 min 52 sec (Job time)

Task h : Define Potential Stalkers A person who visits another person’s Facebook page too much. But they are not friend.

personID f, friendID personID a, visitedID Mapper Friends: Accesslog: - Output key: 2nd field (Person ID): IntWritable 1st Field, PersonID, 3rd Field … - Output value: “<dataset tag>, <ID>”: Text Friends: personID f, friendID Accesslog: personID a, visitedID

Reducer Key:<personID> Value List:<(f,friendID) (a,visitedID) (f,friendID) (a,visitedID) …> Sort the list based on the second field of each element. All visitedID and friendID have the same value will be place next to each other If all ID are visitedID, and it appears too many times (based on a predefined threshold) => Potential stalker. Output: personID visitedID

Sample Result

Thank you! Questions?