Download presentation
Presentation is loading. Please wait.
1
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University
2
The Evolution of Search Engines TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 1st generation : Use only "on page", text data - Word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.) 2nd gen. : Use off-page, web-specific data - Link (or connectivity) analysis - Click-through data (what results people click on) - Anchor-text (how people refer to a page) From 1998 (made popular by Google but everyone now) PageRank [2], introduced by Brin and Page, used by Google HITS [3], introduced by Kleinberg (used by Teoma?)
3
Link-based ranking: HITS Motivation (compare PageRank): Broad-topic queries: deliver (too) large set of relevant results Therefore: Ranking based on the authority of a web page (cf. PageRank: quality / importance) Link: Interpreted as a conferral of authority Goal: Find pages with high authority (balance between relevance and popularity)
4
Link-based ranking: HITS (cont.) Basic idea : Consider sub-graph of the web graph that contains as much relevant pages as possible Analyze the graph's link structure to find: Authorities = the most authoritative or definitive subset of relevant pages (for ranking) Hubs = Pages pointing to many related authorities (for their identification)
5
Authorities and Hubs - Example Example: Query “search engine” www.google.com www.teoma.com www.alltheweb.com www.altavista.com AUTHORITIES HUBS dir.yahoo.com/ Computers_and_Internet/ Internet/World_Wide_Web/ Searching_the_Web/ Search_Engines_and_Directories/ searchenginewatch.com
6
Authorities and Hubs - Basic idea Approach : - Generate a query-dependent sub-graph - Recursively calculate hubs and authorities Assume S is the set of pages in this sub- graph, then S should be - rather small - contain lots of relevant pages - contain the most important authorities Basic idea to generate such a sub-graph: - Get initial root set based on any IR criteria - Include the local neighborhood of this set
7
Authorities and Hubs - Base set GIVEN : - QUERY Q - TEXT-BASED SEARCH ENGINE SE - CONSTANTS T AND D (NAT. NUMBERS) - SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q ALGORITHM TO CALCULATE SUBGRAPH S(Q) S(Q) := R(Q) FOR EACH PAGE P IN R(Q) T+(P) := SET OF PAGES LINKED BY P T-(P) := SET OF PAGES LINKING TO P ADD ALL PAGES FROM T+(P) TO S(Q) IF |T-(P)| < D THEN ADD ALL PAGES FROM T-(P) TO S(Q) ELSE ADD RANDOM SUBSET OF T-(P) TO S(Q)
8
Query-dependent base set - Comments Why only use a sub -graph? - Advantage of query dependence - Reduces processing time (online calculation!) Why not just take the root set? - Appearance of query terms does not necessarily represent relevance (or authority) - Larger network is needed for link analysis In original work: Heuristics for special cases - Remove intrinsic links, i.e. links from the same domain (navigational links, etc.) - Consider only a certain number of links from one domain to a page p (to avoid spamming)
9
Calculating Hubs and Authorities Obviously, there exists a mutual reinforcing relationship between Hubs and Authorities: - A good Hub links to many good Authorities - A good Authority is linked by many Hubs Hence, use an iterative algorithm to estimate a Hub and Authority value, respectively Hubs: O-OperationAuthorities: I-Operation
10
Calculating Hubs and Authorities Hubs: O-OperationAuthorities: I-Operation q1 q2 q3 PAGE p q1 q2 q3 PAGE p
11
Calculating Hubs and Authorities GIVEN : - SUB-GRAPH G WITH N PAGES (FROM BASE SET S(Q)) - CONSTANT NUMBER K ALGORITHM TO CALCULATE HUBS AND AUTHOR. X0 := (1, 1,..., 1) Y0 := (1, 1,..., 1) FOR i = 1,..., K CALCULATE NEW WEIGHTS Xi BY APPLYING THE I-OPERATION TO Xi-1, Yi-1 CALCULATE NEW WEIGHTS Yi BY APPLYING THE O-OPERATION TO Xi, Yi-1 NORMALIZE Xi AND Yi
12
Calculating Hubs and Authorities Convergence: see lit. Basic idea:
13
PageRank vs. HITS PageRank TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 HITS - Hard to spam - Computes quality signal for all pages - Easy to compute, real- time execution is hard - Query specific - Works on small graphs - Non-trivial to compute - Not query specific - Does not work on small graphs - Local graph structure can be manufactured - Provides a signal only when there is direct connectivity (e.g. home pages) Proven to be effective for general purpose ranking Well suited for supervised directory construction ++ --
14
Commercial search engines using HITS (Maybe?) Teoma, now search.ask.com "Teomas underlying technology is an extension of the HITS algorithm …", C. Sherman, April 2002, http://dc.internet.com/news/article.php/1002061 (Not online anymore)
15
References - HITS [1] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 [2] JON KLEINBERG: "AUTHORITATIVE SOURCES IN A HYPERLINKED ENVIRONMENT", JOURNAL OF THE ACM, VOL. 46, NO. 5, SEPTEMBER 1999
16
General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.