Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.

Similar presentations


Presentation on theme: "Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University."— Presentation transcript:

1 Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University

2 Recap (PageRank and HITS) PageRank and HITS: Both search the web based on a) Relevance (content, anchor text,...) b) Quality, importance, authority,... The latter one: based on link structure PageRank: Global, query-independent, recursive calculation over all pages HITS: Local subgraph containing relevant documents, distinguishes between hubs and authorities

3 Hilltop [1]: Basic Idea Observation : Many web user (authors) - Create web pages with link lists about topics they are very familiar with (experts) - Maintain these pages well / try to keep them up-to-date - Link to good, high quality pages Idea : Try to find such pages automatically and use their link structure for ranking Compare HITS: Similar to hubs but more explicit description of "expert" sources and global view (i.e. query independent)

4 Hilltop [1]: Basic Idea (cont.) An expert page is a page that is about a certain topic and has links to many non- affiliated pages on that topic - " non-affiliated " means authors from non- affiliated organizations (modeled, e.g. by URL processing) - " links to many... pages " can be modeled (e.g.) by a threshold A page is an authority on a query topic if, and only if, some of the best experts on that query topic point to it

5 Hilltop [1]: Basic Idea (cont.) General approach : 1. Identify experts (in advance, i.e. query independent) 2. Select experts for a particular topic (depending on a specific query) 3. Use these experts to find and rank authorities for this topic

6 Identifying good expert pages What makes a good expert and how can they be found? A good expert is objective, diverse, unbiased, and point to numerous non-affiliated pages. Two hosts can be defined as affiliated if... -... they share the same first 3 octets of the IP address OR -... the rightmost non-generic token in the hostname is the same (token = substrings in a hostname delimited by ".")

7 Identifying good expert pages 1st: Devide all (indexed) web pages into groups of affiliated ones 2nd: Get experts (i.e. pages pointing to lots of non-affiliated pages) based on their number of links to different groups (e.g. using a threshold) Note: This is all topic-independent! Possible extensions: - Consider topic-related clusters (if available) - Consider special characteristics of a page (e.g. similar formatting, etc.)

8 Indexing experts Identification of experts: done in advance, i.e. topic / query independent Selection of experts for a particular topic: done during the search process, i.e. query dep. Therefore: create inverted file for all pages that have been identified as an expert Only index so called key phrases, i.e. - Take all words in the title, in headlines (,,... tags), in the anchor text of a URL - Associate these phrases with the respective URLs

9 Search: Get and rank authorities With this, we have: - Experts for different topics - All information we need to select all experts for a particular topic given the query terms q i Query processing is now done in two steps 1. Select & rate experts (based on query) 2. Select & rate authorities (based on experts)

10 1. Select & rate experts Select page as an expert (e.g.) if all query terms q i are associated with at least one URL Rate the selected experts by calculating an expert score for each expert p For this, we define - LevelScore(p) = Weighting of the type of key phrase (e.g. title: 16, heading: 6, anchor: 1) - FullnessFactor(p,q i ) = Measure for the no. of terms in p that contain query terms q IF m  2 THEN FullnessFactor(p,q) = 1 ELSE FullnessFactor(p,q) = 1-(m-2) / plen

11 1. Select & rate experts (cont.) Based on the LevelScore and FullnessScore, some measures S i are calculated as follows: S i =  LevelScore(p) X FullnessFactor(p,q) (with  being the sum over all key phrases p with k-i query terms) The expert score is finally calculated as Expert_score = 2 32 S 0 + 2 16 S 1 + S 2

12

13 2. Select & rate authorities Select pages as targets if they are referenced by at least two of these experts Rate them by calculating a target score: 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T Edge_Score(E,T) = Expert_Score(E)*  query terms q occ(q,T) with occ(q,T) = no. of diff. key phrases for T containing q

14 2. Select & rate authorities (cont.) 1. Calculate an Edge_Score(E,T) for each edge (i.e., link) from any expert E to target page T 2. Check all experts pointing to the same target and for affiliated experts, remove all edges but the one with the highest Edge_Score 3. The Target_Score is now calculated as the sum of all remaining Edge_Scores Possible extension: Combine Target_Scores with a page-dependent Match_Score (depending on the appearance of search terms on the page)

15 Hilltop: Summary Preprocessing : - Divide the web into groups of affiliated pages (based on their authors / URLs) - Select experts (based on linkage and groups) Searching : Select and rate 1. Experts referencing to pages about a particular topic (represented by the query) 2. Authorities for this particular topic

16 Hilltop: Discussion Main properties (when compared to PageRank and HITS): - Topic/query-dependent (unlike PageRank) - Pre-selection of experts (unlike HITS), i.e. - all experts are considered (no subgraph) - efficient online calculation can be done - Page content and structure is considered Potential problems / criticism: - Uses lots of intuitive assumptions that are modeled by heuristics

17 References [1] BHARAT, MIHAILA: WHEN EXPERTS AGREE: USING NON-AFFILIATED EXPERTS TO RANK POPULAR TOPICS. ACM TRANSACTIONS ON INFORMATION SYSTEMS, VOL. 20, NO. 1, JAN. 2002


Download ppt "Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University."

Similar presentations


Ads by Google