Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Similar presentations


Presentation on theme: "CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian."— Presentation transcript:

1 CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian

2 Previously in this class Properties of social networks Probabilistic and game theoretic models for social networks

3 This Lecture Ranking Web Pages using Link Analysis

4 Components of a search engine Crawler  How to handle different types of URL  How often to crawl each page  How to detect “duplicates” Indexer  Data structures (to minimize # of disk access) Query handler  Find the set of pages that contain the query word.  Sort the results.

5 Sorting the search results HITS (Hypertext Induced Topic Selection) J. Kleinberg, “Authorative sources in a hyperlinked environment”, SODA 1998. PageRank S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine”, WWW 1998. L. Page, S. Brin, R. Motwani, and Winograd, “The PageRank citation ranking: bringing order to the web”.

6 Difficulties Too many hits (“abundance”) # indexed pages: 110,000 in 94; 100,000,000 in 97. ) Often too many pages contain the query. Sometimes pages are not suff. self-descriptive. Brin & Page: As of Nov 97, only one in the top four commercial search engine finds itself! Need to find “popular” pages.

7 Link analysis Instead of using text analysis, we analyze the structure of hyperlinks to extract information about the popularity of a page. Advantages:  No need for complicated text analysis  Less manipulable, and independent of one person’s point of view. (think of it as a voting system).

8 Relevance vs. popularity Need to achieve a balance between relevance and popularity. Kleinberg’s approach: construct a focused subgraph based on relevance, and return the most popular page in this subgraph. Google’s approach: compute a measure of relevance (considering how many times and in what form [title/url/font size/anchor] the query appears in the page), and multiply with a popularity measure called PageRank.

9 Constructing a focused subgraph Desired properties:  Relatively small  Rich in relevant pages  Contains most of the strongest authorities on the subject.

10 Constructing a focused subgraph Given query , start with the set R  of the top ~200 text-based hits for . Add to this set:  the set of pages that have a link from a page in R  ;  the set of pages that have a link to a page p in R , with an upper limit of ~50 pages per p 2 R .ssdf  Call the resulting set S . Find the most “authorative” page in G[S  ].

11 Finding authorities Approach 1: vertices with the largest in- degrees This approach is used to evaluate scientific citations (the “impact factor”). Deficiencies:  A page might have a large in-degree from low- quality pages.  “universally popular” pages often dominate the result.  Easy to manipulate.

12 Finding authorities Approach 2: define the set of authorities recursively.  Best authorities on a subject have a large in- degree from the best hubs on the subject.  Best hubs on a subject give links to the best authorities on the subject. Formulation as a principal eigenvector

13 Discussion This algorithm can also be used to find the closest pages to a give page p.  Let R  be the set of at most ~200 pages that point to p. Can also compute multiple sets of hubs and authorities.

14 PageRank Again, the idea is a recursive definition of importance:  An important page is a page that has many links from other important pages. Problems:  Not always well-defined.  Pages with no out-degree form rank sinks.

15 PageRank Fix: consider a “random surfer”, which every time either clicks on a random link, or with probability , gets bored and starts again from a random page. PageRank takes  ¼ 1/7, and uses a non- uniform distribution for starting again.


Download ppt "CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian."

Similar presentations


Ads by Google