CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TrustRank Algorithm Srđan Luković 2010/3482
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS171 Introduction to Computer Science II Graphs Strike Back.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
1 Searching the Web Junghoo Cho UCLA Computer Science.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Presented By: - Chandrika B N
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Adversarial Information Retrieval The Manipulation of Web Content.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Crawling The Web For a Search Engine Or Why Crawling is Cool.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
9 Algorithms: PageRank. Ranking After matching, have to rank:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Autumn Web Information retrieval (Web IR) Handout #11:FICA: A Fast Intelligent Crawling Algorithm Ali Mohammad Zareh Bidoki ECE Department, Yazd.
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Neighborhood - based Tag Prediction
The PageRank Citation Ranking: Bringing Order to the Web
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Link-Based Ranking Seminar Social Media Mining University UC3M
CS246 Page Refresh.
Centrality in Social Networks
9 Algorithms: PageRank.
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
CS246 Web Characteristics.
9 Algorithms: PageRank.
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Description of PageRank
CS246: Web Characteristics
Junghoo “John” Cho UCLA
Presentation transcript:

CS246: Page Selection

Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar site How to select the pages to download? 2

Junghoo "John" Cho (UCLA Computer Science) 3 Challenges Due to Infinity What does Web coverage mean? – 8 billion vs 20 billion How much should I download? – 8 billion? 100 billion? – How much have I covered? – When can I stop? How to maximize coverage? – How can we define coverage?

Junghoo "John" Cho (UCLA Computer Science) 4 RankMass Web coverage weighted by PageRank Q: Why PageRank? A: – Primary ranking metric for search results – User’s visit probability under random surfer model

PageRank A page is important if it is pointed by many important pages PR(p) = PR(p 1 )/c 1 + … + PR(p k )/c k p i : page pointing to p, c i : number of links in p i PageRank of p is the sum of PageRanks of its parents One equation for every page – N equations, N unknown variables

PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click

Damping Factor and Trust Score Users do not always follow link – They get distracted and “jump” to other pages – d : Damping factor. Probability to follow links. – t i : Trust score. Non-zero only for the pages that user trusts and jumps to. “TrustRank”, “Personalized PageRank”

Junghoo "John" Cho (UCLA Computer Science) 8 RankMass: Definition RankMass of D C : – Assuming personalized PageRank Now what? How can we use it for the crawling problem? 8

Junghoo "John" Cho (UCLA Computer Science) 9 Two Crawling Challenges Coverage guarantee: – Given , make sure we download at least 1-  Crawling efficiency: – For a given |D C |, pick D C such that RM(D C ) is the maximum 9

Junghoo "John" Cho (UCLA Computer Science) 10 RankMass Guarantee Q: How can we provide RankMass guarantee when we stop? Q: How do we calculate RankMass without downloading the whole Web? Q: Any way to provide the guarantee without knowing the exact PageRank?

Junghoo "John" Cho (UCLA Computer Science) 11 RankMass Guarantee We can’t compute the exact PageRank but can lower bound How? Let’s a start with a simple case 11

Junghoo "John" Cho (UCLA Computer Science) 12 Single Trusted Page t 1 =1 ; t i = 0 (i≠1) Always jump to p 1 when bored N L (p 1 ): pages reachable from p 1 in L links 12

Junghoo "John" Cho (UCLA Computer Science) 13 Single Trusted Page 13 Q: What is the probability to get to a page L links away from P1?

Junghoo "John" Cho (UCLA Computer Science) 14 RankMass Lower Bound: Single Trusted Page Assuming the trust vector T (1), the sum of the PageRank values of all L-neighbors of p1 is at least d L+1 close to 1 14

Junghoo "John" Cho (UCLA Computer Science) 15 PageRank Linearity Let be PageRank vector based on trust vector. That is, Then, for any 15

Junghoo "John" Cho (UCLA Computer Science) 16 RankMass Lower Bound: General Case The RankMass of the L-neighbors of the group of all trusted pages G, N L (G), is at least d L+1 close to 1. That is: Q: Given the result, how should we download for RankMass guarantee? 16

Junghoo "John" Cho (UCLA Computer Science) 17 The L-Neighbor Crawler 1.L := 0 2.N[0] = {pi| ti > 0} // Start with trusted pages 3.While (  < d L+1 ) 1.Download all uncrawled pages in N[L] 2.N[L + 1] = {all pages linked to by a page in N[L]} 3.L = L + 1 Essentially, a BFS (Breadth-First Search) crawling algorithm 17

Junghoo "John" Cho (UCLA Computer Science) 18 Crawling Efficiency For a given |D C |, pick D C such that RM(D C ) is the maximum Q: Can we use L-Neighbor? A: – L-Neighbor simple, but we need to further prioritize certain pages over others – Page level prioritization. 18

Junghoo "John" Cho (UCLA Computer Science) 19 Page Level Prioritization Q: What page should we download first to maximize RankMass? A: Pages with high PageRank Q: How do we know high PageRank pages? The idea: – Calculate PageRank lower bound of undownloaded pages – Give high priority to high lower bound pages 19

Junghoo "John" Cho (UCLA Computer Science) 20 Calculating PageRank Lower Bound PR(p): Probability random surfer at p Breakdown path by “interrupts”, jumps to a trusted page Sum up all paths that start with an interrupt, jump to a trusted page and end with p Interrupt Pj (1-d) (t j ) (d*1/3)(d*1/5) (d*1/3)(d*1/4) (d*1/3) P3P1P2 P4P5 Pi 20

Junghoo "John" Cho (UCLA Computer Science) 21 Calculating PageRank Lower Bound Q: What if we sum up the probabilities of the subsets of the paths to p? A: “lower bound” of PageRank p Basic idea – Start with the set of trusted pages G – Enumerate paths to a page p as we discover links – Sum up the probability of each discovered path to p Not every path needed. Only the ones that we have discovered so far

Junghoo "John" Cho (UCLA Computer Science) 22 RankMass Crawler: High Level Dynamically update lower bound on PageRank – By enumerating paths to pages Download page with highest lower bound – Sum of downloaded lower bounds = RankMass coverage 22

Junghoo "John" Cho (UCLA Computer Science) 23 RankMass Crawler CRM = 0 // CRM: crawled RankMass rm i = (1 − d)t i for each t i > 0 // rm i : RankMass (PageRank lower bound) of p i While (CRM < 1 −  ): – Pick p i with the largest rm i. – Download p i if not downloaded yet CRM = CRM + rm i // we have downloaded p i For each p j linked to by p i : rm j = rm j + d/c i rm i // Update RankMass based on the discovered links from p i rm i = 0 23

Junghoo "John" Cho (UCLA Computer Science) 24 Experimental Setup HTML files only Algorithms simulated over web graph Crawled between Dec’ 2003 and Jan’ millon URLs span over 6.9 million host names 233 top level domains. 24

Junghoo "John" Cho (UCLA Computer Science) 25 Metrics Of Evaluation 1.How much RankMass is actually collected during the crawl 2.How much RankMass is “known” to have been collected during the crawl 25

Junghoo "John" Cho (UCLA Computer Science) 26 L-Neighbor 26

Junghoo "John" Cho (UCLA Computer Science) 27 RankMass 27

Junghoo "John" Cho (UCLA Computer Science) 28 Algorithm Efficiency AlgorithmDownloads required for above 0.98% guaranteed RankMass Downloads required for above 0.98% actual RankMass L-Neighbor7 million65,000 RankMass131,07227,939 Optimal27,101 28

Summary Web crawler and its challenges Page selection problem PageRank RankMass guarantee Computing PageRank lower bound RankMass crawling algorithm Any questions?