Junghoo “John” Cho UCLA

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Google’s PageRank By Zack Kenz. Outline Intro to web searching Review of Linear Algebra Weather example Basics of PageRank Solving the Google Matrix Calculating.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
The PageRank Citation Ranking “Bringing Order to the Web”
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Adversarial Information Retrieval The Manipulation of Web Content.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Author(s): Rahul Sami and Paul Resnick, 2009 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
Lecture #11 PageRank (II)
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
DTMC Applications Ranking Web Pages & Slotted ALOHA
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Iterative Aggregation Disaggregation
Lecture 22 SVD, Eigenvector, and Web Search
CS 440 Database Management Systems
9 Algorithms: PageRank.
Description of PageRank
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Junghoo “John” Cho UCLA CS246: PageRank Junghoo “John” Cho UCLA

Problems of TFIDF Q: Using TFIDF, what pages are likely to be returned for the query “BestBuy”? TFIDF works well on small controlled corpus, but not on the Web Do users really want to see pages that contain the word “BestBuy” many times for the query BestBuy? Easy to spam Ranking purely based on page content Authors can manipulate page content to get high ranking Q: How can search engine figure out the pages that users truly have in mind?

𝑃 𝑅(𝑑)=1 𝑞 = 𝑃 𝑞 𝑅 𝑑 =1) 𝑃 𝑅 𝑑 =1 𝑃(𝑞) 𝑃 𝑅(𝑑)=1 𝑞 = 𝑃 𝑞 𝑅 𝑑 =1) 𝑃 𝑅 𝑑 =1 𝑃(𝑞) TFIDF (or probabilistic model) ignore 𝑃 𝑅 𝑑 =1 and focus only on 𝑃 𝑞 𝑅 𝑑 =1) But many Web pages share the “same language model” with the query BestBuy! We need know 𝑃 𝑅 𝑑 =1 To find the “ideal” BestBuy page, we need 𝑃 𝑅 𝑑 =1 , not just 𝑃 𝑞 𝑅 𝑑 =1) 𝑃 𝑅 𝑑 =1 : Global “popularity” of page 𝑑 independent of query Q: How can we estimate 𝑃 𝑅 𝑑 =1 ? A: Many approaches are possible Collect users’ bookmarks Collect users’ click data …

Link-Based Ranking Basic idea Example People create a link to a page because they find the page useful Let us use “link structure” of the Web to measure a page’s popularity/quality Example Many pages point to BestBuy home page with the anchor text “BestBuy” Q: How can we use the link structure to measure page popularity?

Simple Link Count Count the number of pages linking to the page Unfortunately, this does not work well Too easy to spam: create many new pages and add link to a spam page Q: Any way to avoid link spamming?

PageRank A page is important if it is pointed by many important pages 𝑃𝑅(𝑝) = 𝑃𝑅(𝑝1)/𝑐1 + … + 𝑃𝑅(𝑝𝑘)/𝑐𝑘 𝑝𝑖 : page pointing to 𝑝, 𝑐𝑖 : number of links in 𝑝𝑖 Division by 𝑐𝑖 makes the “matrix” stochastic More discussion later PageRank of p is the sum of PageRanks of its parents Q: But the definition is circular! Is the definition well-founded? Are there a solution to the equations? One equation for every page N equations, N unknown variables

Example Netflix, Microsoft and Amazon PR(n) = PR(n)/2 + PR(a)/2 Nf Am MS PR(n) = PR(n)/2 + PR(a)/2 PR(m) = PR(a)/2 PR(a) = PR(n)/2 + PR(m)

PageRank: Matrix Notation Web graph matrix 𝑀={ 𝑚𝑖𝑗 } Each page i corresponds to row i and column i of the matrix M mij = 1/n if page i is one of the n children of page j mij = 0 otherwise PageRank vector: 𝑝 = 𝑝 1 𝑝 2 𝑝 3 PageRank equation: 𝑝 =𝑀 𝑝 Q: How can we calculate it?

PageRank: Iterative Calculation Initially assign equal importance 1 𝑁 to every page At each iteration, each page shares its importance among its children and receives new importance from its parents Repeat until the importance of each page converges Q: Is it guaranteed to converge?

Example Nf MS Am 10

PageRank as Eigenvector PageRank equation: 𝑝 =𝑀 𝑝 𝑝 is the principal eigenvector of M The principal eigenvalue of a stochastic matrix is 1

PageRank and Random Surfer Model PageRank is he probability of a Web surfer to reach the page after many clicks, following random links Random Click

Problems on the Real Web Dead end A page with no links to send importance All importance “leaks out of” the Web Crawler trap A group of one or more pages that have no links out of the group Accumulate all the importance of the Web

Example: Dead End No link from Microsoft Dead end Nf MS Am

Example: Dead End Q: How can we avoid the dead-end problem? Nf MS Am 15

Solution to Dead End Option 1: Option 2: Remove all dead end Q: Does it really solve the problem? Option 2: Assume a surfer to jumps to a random page at a dead end Nf Am MS

Example: Crawler Trap Only self-link at Microsoft Crawler trap Nf MS

Example: Crawler Trap Nf MS Am Q: How can we avoid this problem? 18

Crawler Trap: Damping Factor Create an “exit path” in every page! Probability to jump to a random page Assuming 20% random jump

Crawler Trap: Damping Factor Random surfer interpretation A surfer gets “bored” after a few clicks and randomly jumps to another page Damping factor makes the graph a fully-connected graph Ensures convergence from iterative computation method

Link-Spam Problem Q: What if a spammer creates a lot of pages and create a link to a single spam page? PageRank better than simple link count, but still vulnerable to link spam Q: Any way to avoid link spam?

TrustRank [Gyongyi et al. 2004] Good pages don’t point to spam pages Trust a page only if it is linked by what you trust Same as PageRank except the random jump probability term

TrustRank: Theory [Bianchini et al. 2005] consider a set of pages S IN(S) S OUT(S) DP(S)

TrustRank: Theory [Bianchini et al. 2005]

What Does It Mean? PS = BS + c PIN − c POUT − c PDP Note: PS = 0 if BS= 0 and PIN= 0 You cannot improve your TrustRank simply by creating more pages and linking within yourself To get non-zero TrustRank, you need to be either trusted or get links from outside

Is TrustRank the Ultimate Solution? Not really… Honeypot: A page with good content with hidden links to spams Good users link to honeypot due to its quality content Blogs, forums, wikis, mailing lists Easy to add spam links Link exchange Set of sites exchanging links to boost ranking A never-ending rat race…

References [Gyongyi et al. 2004] Z. Gyöngyi, H. Garcia-Molina, J. Pedersen: Combating Web Spam with TrustRank, VLDB Conference 2004 [Bianchini et al. 2005] Monica Bianchini, Marco Gori, and Franco Scarselli: Inside PageRank, ACM Transactions on Internet Technology 5(1), February 2005