Crawling and Social Ranking CSCI 572 Class Project Huy Pham USC April 28 th, 2011.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Garbage Collecting the World. --Bernard Lang, Christian and Jose Presented by Shikha Khanna coen 317 Date – May25’ 2005.
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.
Yusuf Simonson Title Suggesting Friends Using the Implicit Social Graph.
On Reducing Communication Cost for Distributed Query Monitoring Systems. Fuyu Liu, Kien A. Hua, Fei Xie MDM 2008 Alex Papadimitriou.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Data Sharing in OSD Environment Dingshan He September 30, 2002.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
Internet Relay Chat Chandrea Dungy Derek Garrett #29.
Secure Search Engine Ivan Zhou Xinyi Dong. Project Overview  The Secure Search Engine project is a search engine that utilizes special modules to test.
Search Engine Optimization
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
N-Tier Architecture.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
Help Desk A walk through the world of Help Desk. Realizing you need help When you realize you need help with your computer, phone, or printer, and your.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
1 All Your iFRAMEs Point to Us Mike Burry. 2 Drive-by downloads Malicious code (typically Javascript) Downloaded without user interaction (automatic),
Adapted from Computer Concepts, New Perspectives, Thompson Course Technology EDW 647: The Internet Dr. Roger Webster & Dr. Nazli Mollah 24 Cookies: What.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Welcome to BLT Online NQT Induction. Points We Will Cover: What is BLT Online NQT Service? What are the advantages of using it? User roles on the site.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
A Web Crawler Design for Data Mining
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Information Systems and Network Engineering Laboratory II DR. KEN COSH WEEK 1.
SCrawler Group: Priyanshu Gupta WHAT WILL I DO?? I will develop a multi-threaded parallel crawler.I will run them in both cross-over and Exchange mode.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Digital Image Processing CCS331 Relationships of Pixel 1.
Downloading and Installing Autodesk Revit 2016
Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Presented by: Sanketh Beerabbi University of Central Florida.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
The Birth & Growth of Web 2.0 COM 415-Fall II Ashley Velasco (Prince)
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
More Distributed Garbage Collection DC4 Reference Listing Distributed Mark and Sweep Tracing in Groups.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
A Multicast Routing Algorithm Using Movement Prediction for Mobile Ad Hoc Networks Huei-Wen Ferng, Ph.D. Assistant Professor Department of Computer Science.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
CSCI 572’s Class Project Measuring the performance of parallel crawlers in different modes Huy Pham PhD – Computer Science Spring 2011.
SEO TIPS. Make the website about one thing  Get Your Domain Name  Choose a Web Host and Sign Up for an Account  Designing your Web Pages  Testing.
1 BCS 4 th Semester. Step 1: Download SQL Server 2005 Express Edition Version Feature SQL Server 2005 Express Edition SP1 SQL Server 2005 Express Edition.
Career Spot Videos The Menu Bar Easily update your information through these quick links Click on the icons to join us on Facebook & Twitter and get immediate.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
AN INTRODUCTION TO FACEBOOK. Learning Objectives A brief introduction to the social networking site Facebook. Instructions to create an account. How to.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Search Engine Optimization
Welcome to BLT Online NQT Induction.
Setting and Upload Products
How do Web Applications Work?
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
UbiCrawler: a scalable fully distributed Web crawler
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 05 – Website Performance Mr C Johnston.
Web Caching? Web Caching:.
Virus Attack Final Presentation
IST 497 Vladimir Belyavskiy 11/21/02
Effective Replica Allocation
Database System Architectures
The Internet and Electronic mail
Presentation transcript:

Crawling and Social Ranking CSCI 572 Class Project Huy Pham USC April 28 th, 2011

Part 1: Crawling and USC Servers’ Analysis Written in Java Each Crawler has 30 threads that work in parallel – multi-threading programming Shared resources are synchronized Java Sockets are used to make Crawlers communicate and exchange data 3 crawlers: USC, Viterbi and LAS (Dornsife)

Warning Message Message from USC regarding the traffic during the run Your dynamic host had significant network HTTP activity between 10:14 and 10:24. It connected to 197 hosts, with port 80/HTTP connections, and 1000 connections a second. We suspect that your host may be compromised, or may be misconfigured. If so, you may have to reinstall your system, install updated service packs, and any relevant security patchs, as other backdoors may have been installed by hackers. If your host causes network problems, it will be blocked.

T(minutes)LAS(pages downloaded)USC(pages downloaded) Viterbi(pages downloaded) Details of the performances of the three servers: LAS, USC, Viterbi Performance of LAS Server is much worse than those of USC and Viterbi (time to process a query and return a response to a request

LAS, USC and Viterbi Servers’ Performances for the table of data above

Zombi Links: Links that are included in pages but don’t work (removed or expired)

What does the peak say????

Explanation for the peak in the graph of the previous slide The seed page (initial page) is and all the links found later are added to a queue and then later get processed as first come – first serve. The seed and its close pages get updated often, and they get processed first by the crawlers, therefore less zombies on the left side. When we get further from the seed, pages get old or expired because they don’t get updated, that’s why zombies increase significantly and that creates the peak. For the descending part of the graph, the number of zombie links increase slowly after the peak because most of them have been examined by the crawler before, but the number of downloaded pages increases, and that leads to the descending branch.

First Conclusion: Crawlers are a powerful tool to analyze servers: Performance: time it takes the server to process a query and return a response Characteristics: Live and dead links that exist in the server. Some part of the server is out of date and many pages were removed or expired, but the links to those pages still remain in other pages – this lead to high portion of zombie links

Firewall Mode d h… … a->de->hk->n The rest …

Explanation for the previous slide First, there is 1 crawler for each site: USC and Viterbi Then, two crawlers for each, then 4, 8, 16. The way a site is divided into divisions is as follow: usc.edu/a->d (after the site’s name is anything that starts with letter a, b, c or d), the second crawler is usc.edu/e->h,…and so on. That last crawler includes any non-alphabetical signs such as ~, number… The coverage is the ratio of the number of unique pages downloaded over the total number of pages it is supposed to download.

Coverage I/E

Explanation of the coverage graph When n = 1 (1 crawler for USC and 1 for Viterbi), the coverage of Viterbi is much higher than that of USC, which says that USC has more pages that are not reachable from the seed usc.edu and those pages are reachable from some links from Viterbi and LAS, on the other hand Viterbi has high coverage because most pages are reachable from inside the Viterbi domain. When n = 2, USC coverage drops dramatically while Viterbi’s coverage does not. This says that USC divides its data into divisions or clusters (academics, admission, finance…) and pages within a division point to each other more than those that belong to different division. Pages of Viterbi are less clustered, they are more uniformly linked to each other throughout the domain.

Characteristics USC: Has more clusters that are somehow independent – different divisions: academics, admission, undergraduates, graduates. Reasons for clustering is because USC has more data to handle than Viterbi and clustering data is necessary. Firewall mode does not work well for USC. Viterbi: Data are more uniformly distributed over the link structure, therefore firewall mode works well for Viterbi.

Cross-over Overlap is defined as the ratio of the total downloaded pages (including duplicates) over the number of unique pages downloaded

Observations When n = 1, 1 crawler for USC, 1 for Viterbi, then overlapping in USC is bigger => more links points to USC from outside (Viterbi and LAS) than for the case of Viterbi When n > 1 when a domain itself is divided into parts and each part has its own crawler. Since Viterbi has a more uniform linking structure => different parts still point to each other => big overlap. Parts from USC less point to each other => less overlap.

Exchange Mode Communication overhead: number of links exchanged (received) per downloaded page (x axis – number of crawlers n) nUSCViterbi Sample data of overlap for n =1 and n = 2

Replicate most popular links Reduce the communication overhead by not exchanging the post popular pages because those pages have very high probability of being downloaded by their own crawlers Most popular pages are those that have the most in-links. How to find these most popular pages? Search engines use link structures to rank their results. Pages that have more in-links have high ranking scores. Use Google Search to find most popular pages for USC and Viterbi. Examples of most popular pages for USC are shown below. When Vitervi crawler sees those links, it will not send it to the USC Crawler, hence reduce the traffic s.html

New achievement in reducing communication overhead For the case when there is one crawler for each site (USC or Viterbi) the green cells are the results after reducing communication overhead by replicating (not exchanging) the most popular links (pages) of each site. LAS site is not examined due to its slow responses – it takes too long to crawl LAS. Its server’s performance is compared graphically to those of Viterbi and USC in the 5 th slide. nUSCViterbiUSCViterbi

Conclusions for Crawling Crawling is an excellent tool to analyze server’s performance and characteristics of web sites. For sites with uniform (non-clustered) linking structures, Firewall Mode outperforms Cross-over mode, saves network traffic compared to exchange mode -> therefore we can dynamically determines what type of crawling (firewall or exchange) to apply for each site in the future by first crawling the site to determine if it has uniform linking structure or not. For sites with clusters such as USC, one must use exchange mode (almost always avoid cross-over mode: overlap is too heavy compared to information exchange traffic)

Part 2: Social Ranking Introduce a ranking algorithm for a new type of data If we are able to crawl Facebook (user IDs and the locations they have been to (checked in) then this algorithm determines how strongly two people are connected to each other based on the locations they went to at the same time, therefore suggest friendship for Facebook taking into account geospatial network. The same applies to Amazon: If two users A and B buy many same products, each product for multiple times, then we can suggest (advertise) products that A has bought but B has not.

Example for Amazon There are three products: grey, blue and white We have the purchase history of users: A and B both bought grey for 4 times, blue for 4 times and white for 6 times. Similar for C and D, while E and F bought grey for 16 times and none for blue and white. The questions is: Do A and B have more or less common interests than C and D or E and F? And how do we rank the similarities of interests among these pairs of people? A and B C and D E and F

Example for Facebook The same situation: if we imagine each box is a place, and the number in its is the number that two users happen to be in that place at the same time. How closely two people are connected based on the number of different places whey went to at the same time and the number of times they went to each place at the same time?

For Facebook, divide the area of interest into cells Different Scales of Cells Cell has size of campus (USC, Hollywood, Grove center..) Cell has size building, center, club A(2,4,1,0,0,0,2) B(0,4,4,0,2,0,2) C(0,0,0,0,2,0,0)

M people, N cells Create co-occurrence vectors For each pair of people, retain the numbers of co-occurrences for each cell only AB = (1,4,3,0,0,0,2) BC = (0,0,1,0,1,0,0) AC = (0,0,0,0,2,0,0) f(AB) > f(BC) > f(AC) A(5,5,10,0,4,0,2) B(2,4, 4,0,2,0,2) C(0,0, 1,0,2,0,0) Total number of pairs (co-occurrence vectors)

Similarity Based on Euclidean Distance Projected Pure Euclidean Distance (PPED) Optional: Memory Saving and Computation Efficiency Pure Euclidean Distance (PED)

Master Vector Calculate and compare the similarities between Ci and V The less distance from Ci to V (closer), the higher the chance of the two people having a social connection. …

Demo for the case of 2 cells Points inside the circle have high number of co-occurrences in each cell, and also have more uniform distribution of co- occurrences over cells

Social Constraints Satisfied by the Metric The more the number of cells of co-occurrences, the less PED C1(1,1,1,…,1,0,…,0) (k non-zero cells) C2(1,1,1,…,1,1,0,…,0) (k+1 non-zero cells)

Co-occurrences at the same cell also count for the similarity C1(1,1,1,…,1,0,…,0) C2(a,1,1,…,1,0,…,0) (a>1)

Co-occurrences at different cells are weighted more than those at the same cell C1(k,0,0,…,0,0,…,0) (k co-occurrences at the same cell) C2(1,1,1,…,1,0,…,0) (k co-occurrences at diff. cells) C1(a,0,0, …0,0,0,…,0) (1 cell) C2(b1,b2,b3,…, bk,0,…,0) ( k cells) Sum(bi) = a

How are co-occurrences at the same cell and at different cells related? C1 (x,0,…,0) C2(1,1,…1,0,…,0) y non-zero cells How much of y would make C2 equivalent to C1? Independent of N x y m = 20, x <= m y=x Saturated point helps avoiding coincidences The two meet at x=1, y=1

Using PPED to save memory and increase computational efficiency For the master vector, instead of N, use P = maximum number of non-zero cells over the co-occurrences vectors Hence the size of each co-occurrence vector can be reduced to P, P << N

How to cut off the uninterested portions for different sets of input data m is different for different inputs Number of friends (potential chance to be friends) fluctuate around an average????

Challenges – Future Work Crawl Facebook for users and get their check-ins and relationships. Not all data available to public Amazon: User’s purchase history is private, protected by Amazon. Using their comments on products to determine what they bought? Not all users post comments and impossible to know how many times a user has bought this product. Flickr: Photos have information about dates and coordinates, but not social connections available to test, and not all photos were taken by the user who uploaded them. The model works well for the owner of the data, or any third parties that have access to the data to make advertisements to related users.