Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Slides:



Advertisements
Similar presentations
Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Advertisements

Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Web Search – Summer Term 2006 IV. Web Search - Crawling (part 2) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
The PageRank Citation Ranking “Bringing Order to the Web”
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Web Search – Summer Term 2006 VII. Selected Topics - The Hilltop Algorithm (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Web Search – Summer Term 2006 VII. Selected Topics - Metasearch Engines [1] (c) Wolfgang Hürst, Albert-Ludwigs-University.
(hyperlink-induced topic search)
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
Overview of Web Ranking Algorithms: HITS and PageRank
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Dec.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
HITS Hypertext-Induced Topic Selection
Lecture #11 PageRank (II)
7CCSMWAL Algorithmic Issues in the WWW
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
Information retrieval and PageRank
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University

The Evolution of Search Engines TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR st generation : Use only "on page", text data - Word frequency, language (AltaVista, Excite, Lycos, etc.) 2nd gen. : Use off-page, web-specific data - Link (or connectivity) analysis - Click-through data (what results people click on) - Anchor-text (how people refer to a page) From 1998 (made popular by Google but everyone now) PageRank [2], introduced by Brin and Page, used by Google HITS [3], introduced by Kleinberg (used by Teoma?)

Link-based ranking: HITS Motivation (compare PageRank): Broad-topic queries: deliver (too) large set of relevant results Therefore: Ranking based on the authority of a web page (cf. PageRank: quality / importance) Link: Interpreted as a conferral of authority Goal: Find pages with high authority (balance between relevance and popularity)

Link-based ranking: HITS (cont.) Basic idea : Consider sub-graph of the web graph that contains as much relevant pages as possible Analyze the graph's link structure to find: Authorities = the most authoritative or definitive subset of relevant pages (for ranking) Hubs = Pages pointing to many related authorities (for their identification)

Authorities and Hubs - Example Example: Query “search engine” AUTHORITIES HUBS dir.yahoo.com/ Computers_and_Internet/ Internet/World_Wide_Web/ Searching_the_Web/ Search_Engines_and_Directories/ searchenginewatch.com

Authorities and Hubs - Basic idea Approach : - Generate a query-dependent sub-graph - Recursively calculate hubs and authorities Assume S is the set of pages in this sub- graph, then S should be - rather small - contain lots of relevant pages - contain the most important authorities Basic idea to generate such a sub-graph: - Get initial root set based on any IR criteria - Include the local neighborhood of this set

Authorities and Hubs - Base set GIVEN : - QUERY Q - TEXT-BASED SEARCH ENGINE SE - CONSTANTS T AND D (NAT. NUMBERS) - SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q ALGORITHM TO CALCULATE SUBGRAPH S(Q) S(Q) := R(Q) FOR EACH PAGE P IN R(Q) T+(P) := SET OF PAGES LINKED BY P T-(P) := SET OF PAGES LINKING TO P ADD ALL PAGES FROM T+(P) TO S(Q) IF |T-(P)| < D THEN ADD ALL PAGES FROM T-(P) TO S(Q) ELSE ADD RANDOM SUBSET OF T-(P) TO S(Q)

Query-dependent base set - Comments Why only use a sub -graph? - Advantage of query dependence - Reduces processing time (online calculation!) Why not just take the root set? - Appearance of query terms does not necessarily represent relevance (or authority) - Larger network is needed for link analysis In original work: Heuristics for special cases - Remove intrinsic links, i.e. links from the same domain (navigational links, etc.) - Consider only a certain number of links from one domain to a page p (to avoid spamming)

Calculating Hubs and Authorities Obviously, there exists a mutual reinforcing relationship between Hubs and Authorities: - A good Hub links to many good Authorities - A good Authority is linked by many Hubs Hence, use an iterative algorithm to estimate a Hub and Authority value, respectively Hubs: O-OperationAuthorities: I-Operation

Calculating Hubs and Authorities Hubs: O-OperationAuthorities: I-Operation q1 q2 q3 PAGE p q1 q2 q3 PAGE p

Calculating Hubs and Authorities GIVEN : - SUB-GRAPH G WITH N PAGES (FROM BASE SET S(Q)) - CONSTANT NUMBER K ALGORITHM TO CALCULATE HUBS AND AUTHOR. X0 := (1, 1,..., 1) Y0 := (1, 1,..., 1) FOR i = 1,..., K CALCULATE NEW WEIGHTS Xi BY APPLYING THE I-OPERATION TO Xi-1, Yi-1 CALCULATE NEW WEIGHTS Yi BY APPLYING THE O-OPERATION TO Xi, Yi-1 NORMALIZE Xi AND Yi

Calculating Hubs and Authorities Convergence: see lit. Basic idea:

PageRank vs. HITS PageRank TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 HITS - Hard to spam - Computes quality signal for all pages - Easy to compute, real- time execution is hard - Query specific - Works on small graphs - Non-trivial to compute - Not query specific - Does not work on small graphs - Local graph structure can be manufactured - Provides a signal only when there is direct connectivity (e.g. home pages) Proven to be effective for general purpose ranking Well suited for supervised directory construction ++ --

Commercial search engines using HITS (Maybe?) Teoma, now search.ask.com "Teomas underlying technology is an extension of the HITS algorithm …", C. Sherman, April 2002, (Not online anymore)

References - HITS [1] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 [2] JON KLEINBERG: "AUTHORITATIVE SOURCES IN A HYPERLINKED ENVIRONMENT", JOURNAL OF THE ACM, VOL. 46, NO. 5, SEPTEMBER 1999

General Web Search Engine Architecture CLIENT QUERY ENGINE RANKING CRAWL CONTROL CRAWLER(S) USAGE FEEDBACK RESULTS QUERIES WWW COLLECTION ANALYSIS MOD. INDEXER MODULE PAGE REPOSITORY INDEXES STRUCTUREUTILITYTEXT (CF. [1] FIG. 1)