Download presentation
Presentation is loading. Please wait.
Published byAudra Caldwell Modified over 9 years ago
1
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen
2
o Introduction General idea Related work o Hilltop algorithm Overview Algorithm phases o Expert documents Detecting host affiliation Selecting experts Indexing the experts o Query processing Computing expert score Computing target score o Evaluation o Conclusions Outline:
3
General Idea Propose a ranking scheme for popular topics that places the most authoritative pages on the query topic at the top of the ranking.
4
Introduction Queries on popular topics tend to produce a large result set. This set is hard to rank based on content only. Content analysis cannot distinguish between authoritative and non-authoritative pages. Hence, other sources of information is used to rank results.
5
Related Work Approaches to improve the authoritativeness of ranked results that have been taken in the past: Ranking Based on Human Classification Ranking Based on Usage Information Ranking Based on Connectivity
6
Ranking Based on Human Classification Human editors have been used by companies (such as Yahoo!) to manually associate a set of categories and keywords with a subset of documents on the web. Disadvantages: o Slow and can only be applied to a small number of pages. o Keywords and classifications are inadequate or incomplete.
7
Ranking Based on Usage Information Some services collect information on: Queries users submit to search services. Pages they look at subsequently and the time spent on each page. This information is used to return pages that most users visit after deploying the given query. Disadvantages: o Large amount of data needs to be collected for each query thus, potential set of queries is small. o Open to spamming.
8
Ranking Based on Connectivity Analyzing the hyperlinks between pages on the web on the assumption that: a)Pages on the topic link to each other. b)Authoritative pages tend to point to other authoritative pages. Two kinds of algorithms: o PageRank o Topic Distillation
9
PageRank Algorithm to rank pages based on assumption b. Computes a query-independent authority score for every page on the web and uses this score to rank the result set. Can’t distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.
10
Topic Distillation Computes a query specific subgraph of web pages. Computes a score for every page in the subgraph - every page is given an authority score. A preliminary ranking for the query is done with content analysis. The top ranked result pages for the query are selected. This creates a selected set. Some of the pages within one or two links from the selected set are also added to the selected set if they are on the query topic.
11
Hilltop Algorithm Overview
12
How Does It Work? “Expert Documents”- subset of pages on the web identified as directories of links to non-affiliated sources on specific topics. Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page.
13
List of experts Identify target pages Rank targets Compute list of most relevant experts on the query topic Identify relevant links in the experts set and follow them to get target pages Rank according to number and relevance of non-affiliated experts
14
Hilltop Algorithm Phases
15
Expert Lookup What is an expert page? A page that is about a certain topic and has links to many non-affiliated pages on that topic. Two pages are non-affiliated if they are by authors from non-affiliated organizations.
16
subset of pages crawled by a search engine are identified as experts The pages are indexed in a special inverted index Given an input query, a lookup is done on the expert-index to find and rank matching expert pages
17
Target Ranking o Given the top ranked matching expert-pages and associated match information, we select links that we know to have all the query terms associated with them. o With further connectivity analysis on the selected links, we identify a subset of their targets as the top-ranked pages on the query topic. o The targets are rated by a ranking score which is computed by combining the scores of the experts pointing to the target. A page is an authority on the query topic Some of the best experts on that query topic point to it
18
Expert Documents
19
What makes a page an expert? An expert page needs to be objective and diverse. Its links should be unbiased and point to numerous non- affiliated pages on the subject.
20
Selecting the Experts Process a search engine’s database of pages and select a subset considered to be good sources of links on specific topics. Expert pages - pages with out-degree greater than a threshold, k, whose URLs point to k distinct non- affiliated hosts.
21
Detecting Host Affiliation Two hosts are defined as affiliated if one or both of the following is true: o They share the same first 3 octets of the IP address. o The rightmost non-generic token in the hostname is the same. For example: www.ibm.com www.ibm.co.uk
22
Using a union-find algorithm we group into sets, hosts that either share the same rightmost non-generic suffix or have an IP address in common. …
23
Every set is given a unique identifier. The host-affiliation lookup maps every host to its set identifier or to itself. If the lookup maps two hosts to the same value then they are affiliated; otherwise they are non-affiliated. 1n3 2 …
24
Indexing the Experts To locate expert pages that match user queries, create an inverted index to map keywords to experts on which they occur. Index text contained within “key phrases” of the expert. (title,headings,URL anchor text within the expert page) The inverted index is organized as a list of match positions within experts. Each match position corresponds to an occurrence of a certain keyword within a key phrase of a certain expert page. For every expert, we maintain the list of URLs within it and for each URL we maintain the identifiers of the key phrases that qualify it.
25
Query Processing
26
In response to a user query, determine a list of N experts that are the most relevant for that query. Rank results by selectively following the relevant links from these experts and assigning an authority score to each page.
27
Computing the Expert Score
29
Computing the Target Score Targets- pages pointed to by the top N experts Select top ranked documents from this set of targets. The list of targets is ranked by Target_Score. Target must be pointed to by at least 2 experts on hosts that are mutually non-affiliated and are not affiliated to the target.
31
Second Step: Check for affiliations between expert pages that point to the same target. If two affiliated experts have edges to the same target T, then discard the edge which has the lower Edge_Score of the two. Third Step: To compute the Target_Score of a target we sum the Edge_Score of all edges on it.
33
Evaluation
34
Evaluation Two user studies were conducted in August 1999 in order to estimate recall and precision. Both experiments involved three commercial search engines for comparison: AltaVista, DirectHit and Google (marked as E1, E2, E3 to avoid controversy)
35
Locating Specific Popular Targets Seven volunteers were asked to suggest the home pages of ten organizations of their choice. Some of the queries reproduced:
36
The same query was sent to the commercial search engines and to Hilltop. Every time the home page was found within the first ten results, its rank was recorded.
37
Average recall at rank k is the probability of finding the desired home page within the first k results.
38
Gathering Relevant Pages The volunteers were asked to think of broad or popular topics and formulate queries. The 25 queries that were collected:
39
Each query was submitted to all four search engines, and the top 10 results were collected from each, recording the URL, rank and engine that found it. For each query, a list of unique URLs in the union of the results from all engines was generated. The list was presented to a judge in a random order, who rated each page for relevance to the given query on a binary scale. The ratings were combined with the information about source and rank and the average precision was computed at rank k (for k = 1, 5, and 10).
40
These results indicate that for broad subjects the engine returns a large percentage of highly relevant pages among the ten best ranked pages
41
Conclusions Given a query, Hilltop generates a list of target pages which are likely to be very authoritative pages on the topic of the query. In computing the usefulness of a target page we only consider links originating from expert pages, which are directories of links pointing to many non- affiliated sites.
42
In computing the level of relevance, we require a match between the query and the text on the expert page which qualifies the hyperlink being considered. For further accuracy, we require that at least 2 non- affiliated experts point to the returned page, with relevant qualifying text describing their linkage. Hilltop delivers a high level of relevance given broad queries and performs comparably to the best of the commercial search engines tested.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.