Download presentation
Presentation is loading. Please wait.
1
Project 1 : Who is Popular, and Who is Not.
Angel Trifonov Anh Pham Xiao Qin
2
Tasks Task b, c both in Pig and Java Task h in Java
3
Task b in Java Write a job(s) that reports for each country, how many of its citizens have a Facebook page. Single map-reduce job Input: MyPage datasets Mapper: examine each file line-by-line Each line converted to a string String is split using “,” delimiter Extract nationality and map to an IntWriteable Reducer: take all pairs and sum values for each key Output: number of users per nationality Single reducer
4
Task b in Pig Group Mypage dataset based on Country code:
countrygrp = group mypage by cc; Report number of people that have Facebook page for each country: taskb = foreach countrygrp generate group, COUNT(mypage.id); dump taskb; Running Time Comparison: Plain MapReduce: 1 min 36 sec (Job time) Pig: 24sec (Job time)
5
Task c in Java Find the top 10 interesting Facebook pages, namely, those that got the most accesses based on your AccessLog dataset compared to all other pages. Hadoop Settings: multiple mappers and one reducer. (setNumReduceTasks(1)) Input: AccessLog 1st round: Mapper(s): Parse the input data. Get the WhatPage. Set WhatPage as the key and a constant number 1 as the value. Reducer: For each key, sum up the total value. Set the WhatPage as the key and the total count as the value 2nd round: Swap the key and value (InverseMapper.class) Output: [Count] , [WhatPage] (in descending order )
6
Task c in Pig Group the Accesslog dataset based on accessed facebook ID: access_fid_grp = group alog by fid; Get the access count for each accessed facebook ID: grpcnt = foreach access_fid_grp generate group,COUNT(alog.aid) as alogcnt; Order the count descending: grporder = order grpcnt by alogcnt desc; List top 10: taskc = limit grporder 10; dump taskc; Running Time Comparison: Plain MapReduce: 2 min 1 sec(Job time) Pig: 1 min 52 sec (Job time)
7
Task h : Define Potential Stalkers
A person who visits another person’s Facebook page too much. But they are not friend.
8
personID f, friendID personID a, visitedID Mapper Friends: Accesslog:
- Output key: 2nd field (Person ID): IntWritable 1st Field, PersonID, 3rd Field … - Output value: “<dataset tag>, <ID>”: Text Friends: personID f, friendID Accesslog: personID a, visitedID
9
Reducer Key:<personID>
Value List:<(f,friendID) (a,visitedID) (f,friendID) (a,visitedID) …> Sort the list based on the second field of each element. All visitedID and friendID have the same value will be place next to each other If all ID are visitedID, and it appears too many times (based on a predefined threshold) => Potential stalker. Output: personID visitedID
10
Sample Result
11
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.