CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

Slides:

Advertisements

Similar presentations

Graphs, Node importance, Link Analysis Ranking, Random walks

Advertisements

Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,

Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA

Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

The PageRank Citation Ranking “Bringing Order to the Web”

Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.

Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.

Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)

Link Analysis, PageRank and Search Engines on the Web

1 Evaluating the Web PageRank Hubs and Authorities.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.

1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.

Google and the Page Rank Algorithm Székely Endre

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Presented By: - Chandrika B N

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.

Adversarial Information Retrieval The Manipulation of Web Content.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

Overview of Web Ranking Algorithms: HITS and PageRank

Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

Ranking Link-based Ranking (2° generation) Reading 21.

Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.

1 CS 430: Information Discovery Lecture 5 Ranking.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page

CS 440 Database Management Systems Web Data Management 1.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.

Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.

The PageRank Citation Ranking: Bringing Order to the Web

Search Engines and Link Analysis on the Web

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan

PageRank and Markov Chains

Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan

Lecture 22 SVD, Eigenvector, and Web Search

CS 440 Database Management Systems

9 Algorithms: PageRank.

Junghoo “John” Cho UCLA

Junghoo “John” Cho UCLA

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

COMP5331 Web databases Prepared by Raymond Wong

Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.

Presentation transcript:

CS246 Link-Based Ranking

Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query: accident report of American Airline flights  Do users really care how many times “American Airlines” mentioned?  Easy to spam  Ranking purely based on page content  Authors can manipulate page content to get high ranking  Any idea?

Link-based Ranking  People “expect” to get AA home page for the query “American Airlines”  Many pages point to AA home page, but not to accident report  Use link-count!

Simple Link Count  Still easy to spam  Create many pages and add links to a page  How to avoid spam?

PageRank  A page is important if it is pointed by many important pages  PR( p ) = PR( p 1 )/ n 1 + … + PR( p k )/ n k p i : page pointing to p, n i : number of links in p i  PageRank of p is the sum of PageRanks of its parents  One equation for every page  N equations, N unknown variables

Example: Web of 1842 Ne Am MS PR(n) = PR(n)/2 + PR(a)/2 PR(m) = +PR(a)/2 PR(a) = PR(n)/2 + PR(m) Netscape, Microsoft and Amazon

PageRank: Matrix Notation  Web graph matrix M = { m ij }  Each page i corresponds to row i and column i of the matrix M  m ij = 1/ n if page i is one of the n children of page j m ij = 0 otherwise  PageRank vector  PageRank equation

PageRank: Iterative Computation  Initially every page has a unit of importance  At each round, each page shares its importance among its children and receives new importance from its parents  Eventually the importance of each page reaches a limit  Stochastic matrix

Example: Web of 1842 Ne Am MS

PageRank: Eigenvector  PageRank equation  is the principal eigenvector of M

PageRank: Random Surfer Model  The probability of a Web surfer to reach a page after many clicks, following random links Random Click

Problems on the Real Web  Dead end  A page with no links to send importance  All importance “leak out of” the Web  Crawler trap  A group of one or more pages that have no links out of the group  Accumulate all the importance of the Web

Example: Dead End  No link from Microsoft Ne Am MS Dead end

Example: Dead End Ne Am MS

Solution to Dead End  Assume a surfer to jumps to a random page at a dead end Ne Am MS

Example: Crawler Trap  Only self-link at Microsoft Ne Am MS Crawler trap

Example: Crawler Trap Ne Am MS

Crawler Trap: Damping Factor  “Tax” each page some fraction of its importance and distribute it equally  Probability to jump to a random page  Assuming 20% tax

Link Spam Problem  Q: What if a spammer creates a lot of pages and create a link to a single spam page?  PageRank better than simple link count, but still vulnerable to link spam  Q: Any way to avoid link spam?

TrustRank [Gyongyi et al. 2004]  Good pages don’t point to spam pages  Trust a page only if it is linked by what you trust  Same as PageRank except the random jump probability term

TrustRank: Theory [Bianchini et al. 2005] consider a set of pages S S IN(S) OUT(S) DP(S)

TrustRank: Theory [Bianchini et al. 2005]

What Does It Mean?  P S = 0 if B S = 0 and P IN = 0  You cannot improve your TrustRank simply by creating more pages and linking within yourself  To get non-zero TrustRank, you need to be either trusted or get links from outside

Is TrustRank the Ultimate Solution?  Not really…  Honeypot: A page with good content with hidden links to spams  Good users link to honeypot due to its quality content  Blogs, forums, wikis, mailing lists  Easy to add spam links  Link exchange  Set of sites exchanging links to boost ranking  A never-ending rat race…

Anti-Spamming at Search Engines  Anchor text  Consider what others think about your page  Give higher weights to anchors from high PageRank pages  More difficult to spam  TrustRank  To gain importance, you need to convince many pages under other’s control or convince search engines  More difficult to spam  Consider inter-site links with higher weight

Hub and Authority  More detailed evaluation of importance  A page is useful if  It has good contents or  It has links to useful pages (good bookmark)  Hub/Authority  Authority: pages with good contents  Hub: pages pointing to good content pages

Hub/Authority: Definition  Recursive definition similar to PageRank  Authority pages are linked to by many hub pages  Hub pages link to many authority pages  H( p ) = A( p 1 ) + … + A( p k ) A( p ) = H( p 1 ) + … + H( p m )

Hub/Authority: Matrix Notation  Web graph matrix A = { a ij }  Each page i corresponds to row i and column i of the matrix A  a ij = 1 if page i points to page j a ij = 0 otherwise  A is not a stochastic matrix  A T : similar to PageRank matrix M, without stochastic restriction

Example: Web of 1842 Ne Am MS [ n, m, a ]: vector

Hub/Authority: Iterative Computation  Hub/Authority vector  : divergence scaling factor   : divergence scaling factor  Compute and iteratively with scaling

Hub/Authority: Eigenvector   : eigenvector of : eigenvector of

Example: Web of 1842 Ne Am MS

Hub/Authority and Root Set  Apply the equations on a small neighbor graph (base set)  Start with, say, 100 pages on “bicycling”  Add pages pointing to the 100 pages  Add pages that the 100 pages are pointing to  Identified pages are good “Hub” and “Authority” on “bicycling”

Hub/Authority and Web Community  Hub/Authority is often used to identify Web communities  Nice notion of “Hub” and “Authority” of the community  Often Hub and Authority are tightly linked to each other

Any Questions?

Questions  Can we apply Hub/Authority to the entire Web like PageRank?

Hub/Authority on the Entire Web?  Hub/Authority works well on a topic-specific subset, but works poorly for the whole Web  Easy to spam 1. Create a page pointing to many authority pages (e.g., Yahoo, Google, etc.)  The page becomes a good hub page 2. On the page, add a link to your home page

Questions  Can we apply PageRank to a small base set?

PageRank on a Small Subset  In general, PageRank works better for larger dataset  We may be able to compute “topic-specific” PageRank  Any other way for “topic-specific” PageRank?

Summary: Link-Based Ranking  PageRank  TrustRank variation  Hub/Authority