Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Hyperlink structure information for web search.

Similar presentations


Presentation on theme: "Using Hyperlink structure information for web search."— Presentation transcript:

1 Using Hyperlink structure information for web search

2 Hyperlink structure information Hyperlink analysis for the web by Monika R. Henzinger, Google Inc. Structural web search using a graph-based discovery system by Nitish Monocha etc., University of Texas

3 How are hyperlinks useful? Assumptions a) Assumption 1. A hyperlink from page A to page B is a recommendation of page B by the author of page A. b) Assumption 2. If page A and page B are connected by a hyperlink, then they might be on the same topic. c) Pages pointed by many pages are of higher quality than pages pointed to by fewer pages.

4 main uses of hyperlink analysis crawling (collecting the pages) ranking (rank the returned results) Compute the geographic scope of a web page Find mirrored host Compute the statistics of web pages and search engine Major search engine use hyperlink analysis but do not want to disclose the algorithms

5 Crawling Collect web pages Start with a set of pages, recursively visit the hyperlinks

6 Traditional IR Vector model or Boolean model Does not work well in the web because: Web page authors manipulate the ranking. The power of hyperlink analysis comes from the fact that it uses the content of other pages to rank the current page.

7 Connectivity-Based Ranking (rank using hyperlink analysis) query-independent schemes, which assign a score to a page independent of a given query; query-dependent schemes, which assign a score to a page in the context of a given query.

8 Model Web pages as graph, page as node, hyperlink as edge. Directed graph: link graph. Used for finding related pages Undirected graph: co-citation graph. Used for categorizing related pages.

9

10

11 Query-independent Ranking Major drawbacks: it does not distinguish between the quality of a page pointed by a number of low- quality pages and the quality of a page pointed to by the same number of high-quality page. PageRank algorithm. Weight each hyperlink to the page proportionally to the quality of the page containing the hyperlink. PageRank of a page A depends on the pagerank of a page B pointing to A. Used by Google.

12 Query-dependent Ranking Build query-specific graph: neighborhood graph. a) Start set of documents matching the query b) Augmented by the sets of the documents that either hyperlinks to or is hyper linked to by the documents in the start set. Perform the hyperlink analysis.

13

14 Query-dependent Ranking(continued) a) Indegree-based approach. (the number of documents hyper linking to a document in the start set) b) Authorities (pages with good content on a topic) and hubs (directory-like pages with many hyperlinks to pages on the topic) c) HITS algorithm to determine good hubs and good authorities. Each node has auth score and hub score.

15 Problems of HITS Small additions to neighborhood graph may considerably change the scores of hub and auth. Topic drift when the majority of pages on neighborhood graph is on a topic different from the query topic.

16 Structural web search using a graph-based discovery system WebSUBDUE: SUBDUE is the engine for knowledge discovery(data mining). Support structural search, text search, synonym search, and combinations of these searches. Data preparation: Crawler written in Perl to build the labeled graph for the web site. Labeled graph is feed into SUDUE system. Query can be modeled as labeled graph as well. Search the sub graph in the graph Make comparison with existing search engine: AltaVista

17

18 Find all pages that link to a page containing the term subdue

19 Jobs in computer science

20 Find hubs and authorities pages on “algorithm”

21 Conclusion Hyperlink structure information is valuable information. Use of hyperlink information to enhance normal web search in crawling, ranking etc. Use of hyperlink information to support structural search, which is still missing in existing search engine.


Download ppt "Using Hyperlink structure information for web search."

Similar presentations


Ads by Google