Download presentation
Presentation is loading. Please wait.
Published byLenard Grant Scott Modified over 9 years ago
1
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada
2
Overview Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research
3
Why Do We Care? Web Link Analysis is crucial for efficient Crawling and Ranking algorithms Crawling: Google Sitemap Submission, Yahoo Directory Ranking: Relevant Result
4
Purpose of The Paper? To Overcome These Challenges: –Its Size & Growth –Its Content Types –Language Semantics –New Language –Staleness of Results –SPAM –And More…
5
Solution: Hyperlinks, Hyperlinks, Hyperlinks… Can Think of the Web as a Directed Graph Node = Web page (URL) Edge = Hyperlink
6
Solution: HITS Algorithm Hyperlink-Induced Topic Search (HITS) –A.k.a. Hubs and Authorities Hubs – Highly-valued lists for a given query –Ex. Yahoo Directory, Open Directory Project and Bookmarking sites. Authorities – Highly endorsed answers to the query –Ex. New York Times, Huffington Post, Twitter It is possible for a webpage to be both Hub and Authority –Ex. Restaurant Review Blogs
7
Solution: HITS Algorithm Cont… For each page p, we assign it two values hub(p) and auth(p) Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number) Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it. Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it. Normalize and Repeat
8
Solution: HITS Algorithm Cont… Hub(p)Num of LinksRaw Score 0.24930.747 0.32141.284 0.18120.362 0.12320.246 0.08820.176 0.0151 0.01820.036 0.0031 1 Sum:1.00 2.872 Authority Pages (q)Raw ScoreAuth(q) SJ Merc News0.570.198 Wall St. Journal0.570.198 New York Times0.8740.304 USA Today0.590.205 Facebook0.1230.043 Yahoo!0.1210.042 Amazon0.0240.008 Sum: 1.000 Calculation
9
Pros: –Accurately addresses concerns and challenges we currently deal with –Great introduction to search engine algorithm –Briefly covered many topics (Breadth)
10
Cons: –Some materials are out of date (1999) –Ex. Google vs. Clever Project –Lack of Depth –Ex. Normalization of Hub and Auth values
11
Further Research: HITS Algorithm – Extreme Cases Large-in-small-out sites –High Auth(p) –No Problem Small-in-large-out sites –High Hub(p) –Problem
12
Further Research: HITS + Relevance Scoring Method Vector Space Model (VSM) –Documents and queries are represented by vectors –Term Frequency Okapi Measurement –Term Frequency + Document Length Cover Density Ranking (CDR) –Phrase Similarity (How close terms appear)
13
Further Research: HITS + Relevance Scoring Method Use Cosine Relevance Test Price Car
14
Further Research: HITS + Relevance Scoring Method Three-Level Scoring Method (TLS) –Manual Evaluation of Relevance Relevant Links = 2 points Slightly Relevant Links = 1 point Inactive Links + Error Links (404, 603) = 0 point Irrelevant Links = 0 point –Order of query terms matters
15
Further Research: Co-citation Graph Regular Link Graph: Co-citation Graph:
16
What’s Next? Google’s New Search Index: Caffeine –Announced June 8 th, 2010 –Up to 50% fresher results –Twice as fast Real Time Search –Twitter / Facebook http://googleblog.blogspot.com/2010/06/our-new-search-index- caffeine.html
17
References Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. Longzhuang Li, Yi Shang, Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi>10.1145/511446.511514] Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50. Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM 46 (5): 604– 632. doi:10.1145/324133.324140. von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.
18
Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.