Download presentation
Presentation is loading. Please wait.
1
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University
2
Personalized and Topic-Sensitive PageRank Uniform distribution (1-d) to model jumps to random web pages is not realistic (bookmarks, known URLs,...) Idea : Use jump probability for personalization, i.e. - Instead of identity matrix: Weighting based on personal preferences (e.g. bookmarks) - Problem: Pre-calculation impossible (personal data!) and online calculation to expensive Another characteristic of PageRank: query independent - Might be critical if page with high PageRank accidentally gets selected as relevant - Idea: Create a topic-sensitive PageRank
3
Topic-Sensitive PageRank Basic idea : 1. Identify topic that might be interesting for the user (e.g. via classification of the query, eval. of context,...) 2. Use pre-calculated, topic-sensitive PageRank Similarities to personalization but - Fixed, pre-specified topics (can be pre-calculated!) - Depending on the actual situation (more flexible) Topic specific PageRank rank jd : Normally: Identity matrix (a ij = 1 or 1/N) Now: Topics c 1,..., c n, e.g. the 16 top-level categories from the Open directory project Topic dependent weighting (1/|T i |) Advantage: Can be calculated in advance SOURCE: [5]
4
Topic-Sensitive PageRank (cont.) Question : Which one to select during run time? Idea : Automatic classification of the topic based on the query q given by the user Extension: Consider context q' of query q, e.g. - surrounding text if query was entered via highlighting - based on the history (if available) etc. Calculation (e.g.) using a unigram language model: SOURCE: [5]
5
Topic-Sensitive PageRank (cont.) Alternative approach : Use probabilities, i.e. - Weighted summation of all topic specific PageRanks for one document - Weights: Depending on the probability of a particular topic being relevant given the query q - Definition: Query-Sensitive Importance Score s qd In practice: Usually just take the three topic-sensitive PageRanks with highest probability Disadvantages: - Fixed set of topics - Depends on training set SOURCE: [5]
6
References - Indexing [1] A. ARASU, J. CHO, H. GARCIA-MOLINA, A. PAEPCKE, S. RAGHAVAN: "SEARCHING THE WEB", ACM TRANSACTIONS ON INTERNET TECHNOLOGY, VOL 1/1, AUG. 2001 Chapter 5 (Ranking and Link Analysis) [2] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 Chapter 2 and 4.5.1 [3] BORDER, KUMAR, MAGHOUL, RAGHAVAN, RAJAGOPALAN, STATA, TOMKINS, WIENER: "GRAPH STRUCTURE IN THE WEB", WWW 2000 [4] PAGE, BRIN, MOTWANI, WINOGRAD: "THE PAGERANK CITATION RANKING: BRINGING ORDER TO THE WEB", STANFORD TECHNICAL REPORT [5] HAVELIWALA: "TOPIC-SENSITIVE PAGERANK", WWW 2002
7
General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.