Download presentation
Presentation is loading. Please wait.
1
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian
2
Previously in this class Properties of social networks Probabilistic and game theoretic models for social networks
3
This Lecture Ranking Web Pages using Link Analysis
4
Components of a search engine Crawler How to handle different types of URL How often to crawl each page How to detect “duplicates” Indexer Data structures (to minimize # of disk access) Query handler Find the set of pages that contain the query word. Sort the results.
5
Sorting the search results HITS (Hypertext Induced Topic Selection) J. Kleinberg, “Authorative sources in a hyperlinked environment”, SODA 1998. PageRank S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine”, WWW 1998. L. Page, S. Brin, R. Motwani, and Winograd, “The PageRank citation ranking: bringing order to the web”.
6
Difficulties Too many hits (“abundance”) # indexed pages: 110,000 in 94; 100,000,000 in 97. ) Often too many pages contain the query. Sometimes pages are not suff. self-descriptive. Brin & Page: As of Nov 97, only one in the top four commercial search engine finds itself! Need to find “popular” pages.
7
Link analysis Instead of using text analysis, we analyze the structure of hyperlinks to extract information about the popularity of a page. Advantages: No need for complicated text analysis Less manipulable, and independent of one person’s point of view. (think of it as a voting system).
8
Relevance vs. popularity Need to achieve a balance between relevance and popularity. Kleinberg’s approach: construct a focused subgraph based on relevance, and return the most popular page in this subgraph. Google’s approach: compute a measure of relevance (considering how many times and in what form [title/url/font size/anchor] the query appears in the page), and multiply with a popularity measure called PageRank.
9
Constructing a focused subgraph Desired properties: Relatively small Rich in relevant pages Contains most of the strongest authorities on the subject.
10
Constructing a focused subgraph Given query , start with the set R of the top ~200 text-based hits for . Add to this set: the set of pages that have a link from a page in R ; the set of pages that have a link to a page p in R , with an upper limit of ~50 pages per p 2 R .ssdf Call the resulting set S . Find the most “authorative” page in G[S ].
11
Finding authorities Approach 1: vertices with the largest in- degrees This approach is used to evaluate scientific citations (the “impact factor”). Deficiencies: A page might have a large in-degree from low- quality pages. “universally popular” pages often dominate the result. Easy to manipulate.
12
Finding authorities Approach 2: define the set of authorities recursively. Best authorities on a subject have a large in- degree from the best hubs on the subject. Best hubs on a subject give links to the best authorities on the subject. Formulation as a principal eigenvector
13
Discussion This algorithm can also be used to find the closest pages to a give page p. Let R be the set of at most ~200 pages that point to p. Can also compute multiple sets of hubs and authorities.
14
PageRank Again, the idea is a recursive definition of importance: An important page is a page that has many links from other important pages. Problems: Not always well-defined. Pages with no out-degree form rank sinks.
15
PageRank Fix: consider a “random surfer”, which every time either clicks on a random link, or with probability , gets bored and starts again from a random page. PageRank takes ¼ 1/7, and uses a non- uniform distribution for starting again.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.