How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

Slides:



Advertisements
Similar presentations
Link Analysis Mark Levene (Follow the links to learn more!)
Advertisements

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
TrustRank Algorithm Srđan Luković 2010/3482
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Link Analysis, PageRank and Search Engines on the Web
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Overview of Web Ranking Algorithms: HITS and PageRank
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Searching the Web Michael L. Nelson CS 495/595 Old Dominion University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Michael L. Nelson CS 432/532 Old Dominion University
Lecture #11 PageRank (II)
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Michael L. Nelson CS 495/595 Old Dominion University
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
CS 440 Database Management Systems
Information retrieval and PageRank
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
PageRank PAGE RANK (determines the importance of webpages based on link structure) Solves a complex system of score equations PageRank is a probability.
Presentation transcript:

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported LicenseCreative Commons Attribution-NonCommercial- ShareAlike 3.0 Unported License

Link Analysis Content analysis is useful, but combining with link analysis allow us to rank pages much more successfully 2 popular methods – Sergey Brin and Larry Page’s PageRank – Jon Kleinberg’s Hyperlink-Induced Topic Search (HITS)

What Does a Link Mean? A A B B A recommends B A specifically does not recommend B B is an authoritative reference for something in A A & B are about the same thing (topic locality)

PageRank Developed by Brin and Page (Google) while Ph.D. students at Stanford Links are a recommendation system – The more links that point to you, the more important you are – Inlinks from important pages are weightier than inlinks from unimportant pages – The more outlinks you have, the less weight your links carry Page et al., The PageRank citation ranking: Bringing order to the web, 1998

Random Surfer Model Model helpful for understanding PageRank The Random Surfer starts at a randomly chosen page and selects a link at random to follow PageRank of a page reflects the probability that the surfer lands on that page

Example of Random Surfer Start at: B ¼ probability of going to A ¼ probability of going to C ¼ probability of going to D ¼ probability of going to E Choose: E ½ probability of going to C ½ probability of going to D

Problem 1: Dangling Node What if we go to A? We’re stuck at a dead-end! Solution: Teleport to any other page at random A A C C D D E E B B

Problem 2: Infinite Loop A A X X What if we get stuck in an a cycle? Solution: Teleport to any other page at random Y Y

Rank Sinks Dangling nodes and cycles are called rank sinks Solution is to add a teleportation probability α to every decision α% chance of getting bored and jumping somewhere else, (1- α)% chance of choosing one of the available links α =.15 is typical

PageRank Definition PageRank of page P i B Pi is set of all pages pointing to P i Sum of PR of all pages pointing to P i Number of outlinks from P j Teleportation probability Total num of pages

PageRank Example PR(C) =.15/ × (PR(B)/4 + PR(E)/2) Problem: What is PR(B) and PR(E)? Solution: Give all pages same PR to start (1/|P|) & iteratively calculate new PR

PageRank Example PR(A) = × PR(B)/4 =.0725 PR(B) = × PR(D)/1 =.2 PR(C) = (PR(B)/4 + PR(E)/2) = (.2/4 +.2/2) =.1575 PR(D) = (PR(B)/4 + PR(E)/2) =.1575 PR(E) = (PR(B)/4 + PR(C)/1) =.2425

Calculating PageRank PageRank is computed over and over until it converges, around 20 times Can also be calculated efficiently using matrix multiplication

PageRank Definition as Matrix Stated as a matrix equation where R is the vector of PageRank values and T the matrix for transition probabilities: where T ij is the probability of going from page j to i: if a link from page j to i exists, otherwise Total num of pagesTotal outlinks from page j

PageRank Matrix Example PR A PR B PR C PR D PR E = = 0.15/5 + (1 – 0.15)/4 = Init PR values 1/|P| No link from C to A so value = 0.15/5

PageRank Issues Richer-get-richer phenomenon – May be difficult for new pages with few inlinks to compete with older, highly linked pages with high PageRank – Could promote small fraction of new pages at random 1 or add decay factor to links Study 2 showed just counting number of inlinks gives similar ranking as PageRank – Study was on small scale and pages were not necessarily “typical” – Counting inlinks more susceptible to spamming 1 Pandey et al., Shuffling a stacked deck, VLDB Amento et al., Does “authority” mean quality?

HITS Hyperlink-induced topic search (HITS) by Kleinberg 1 Hub: page with outlinks to informative web pages Authority: informative/authoritative page with many inlinks Recursive definition: – Good hubs point to good authorities – Good authorities are pointed to by good hubs 1 Kleinberg, Authoritative sources in a hyperlinked environment, J. ACM, 1999

Root and Base Sets D D E E F F H H I I Base set G G Resulting subgraph between 1K – 5K pages Root set A A B B C C

Calculate H and A E is set of all directed edges in subgraph e qp is edge from page q to p inlinks outlinks

Calculating H and A H and A scores computed repetitively until they converge, about iterations Can also be calculated efficiently using matrix multiplication

Example Subgraph A A B B C C D D Adjacency matrix A B C D ABCDABCD

Auth Scores A AAABACADAAABACAD = = Authority scores Transposed adjacency matrix Init hub scores Resulting auth scores Best authority x

Hub Scores HAHBHCHDHAHBHCHD = = Hub scores Adjacency matrix Init auth scores Resulting hub scores Best hub x

Problems with HITS Has not been widely used – IBM holds patent Query dependence – Later implementations have made query independent Topic drift – Pages in expanded base set may not be on same topic as root pages – Solution is to examine link text when expanding

Link Spam There is a strong economic incentive to rank highly in a SERP White hat SEO firms follow published guidelines to improve customer rankings 1 To boost PageRank, black hat SEO practices include: – Building elaborate link farms – Exchanging reciprocal links – Posting links on blogs and forums 1 Google’s Webmaster Guidelines

Combating Link Spam Sites like Wikipedia can discourage links than only promote PageRank by using “nofollow” Go here! Davison 1 identified 75 features for comparing source and destination pages – Overlap, identical page titles, same links, etc. TrustRank 2 – Bias teleportation in PageRank to set of trusted web pages 1 Davison, Recognizing nepotistic links on the web, Gyöngyi et al., Combating web spam with TrustRank, VLDB 2004

Combating Link Spam cont. SpamRank 1 – PageRank for whole Web has power-law distribution – Penalize pages whose supporting pages do not approximate power-law dist Anti-TrustRank 2 – Give high weight to known spam pages and propagate values using PageRank – New pages can be classified spam if large contribution of PageRank from known spam pages or if high Anti- TrustRank 1 Benczúr et al., SpamRank – fully automated link spam detection, AIRWeb Krishnan & Raj, Web spam detection with Anti-TrustRank, AIRWeb 2006