Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada
Overview Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research
Why Do We Care? Web Link Analysis is crucial for efficient Crawling and Ranking algorithms Crawling: Google Sitemap Submission, Yahoo Directory Ranking: Relevant Result
Purpose of The Paper? To Overcome These Challenges: –Its Size & Growth –Its Content Types –Language Semantics –New Language –Staleness of Results –SPAM –And More…
Solution: Hyperlinks, Hyperlinks, Hyperlinks… Can Think of the Web as a Directed Graph Node = Web page (URL) Edge = Hyperlink
Solution: HITS Algorithm Hyperlink-Induced Topic Search (HITS) –A.k.a. Hubs and Authorities Hubs – Highly-valued lists for a given query –Ex. Yahoo Directory, Open Directory Project and Bookmarking sites. Authorities – Highly endorsed answers to the query –Ex. New York Times, Huffington Post, Twitter It is possible for a webpage to be both Hub and Authority –Ex. Restaurant Review Blogs
Solution: HITS Algorithm Cont… For each page p, we assign it two values hub(p) and auth(p) Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number) Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it. Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it. Normalize and Repeat
Solution: HITS Algorithm Cont… Hub(p)Num of LinksRaw Score Sum: Authority Pages (q)Raw ScoreAuth(q) SJ Merc News Wall St. Journal New York Times USA Today Facebook Yahoo! Amazon Sum: Calculation
Pros: –Accurately addresses concerns and challenges we currently deal with –Great introduction to search engine algorithm –Briefly covered many topics (Breadth)
Cons: –Some materials are out of date (1999) –Ex. Google vs. Clever Project –Lack of Depth –Ex. Normalization of Hub and Auth values
Further Research: HITS Algorithm – Extreme Cases Large-in-small-out sites –High Auth(p) –No Problem Small-in-large-out sites –High Hub(p) –Problem
Further Research: HITS + Relevance Scoring Method Vector Space Model (VSM) –Documents and queries are represented by vectors –Term Frequency Okapi Measurement –Term Frequency + Document Length Cover Density Ranking (CDR) –Phrase Similarity (How close terms appear)
Further Research: HITS + Relevance Scoring Method Use Cosine Relevance Test Price Car
Further Research: HITS + Relevance Scoring Method Three-Level Scoring Method (TLS) –Manual Evaluation of Relevance Relevant Links = 2 points Slightly Relevant Links = 1 point Inactive Links + Error Links (404, 603) = 0 point Irrelevant Links = 0 point –Order of query terms matters
Further Research: Co-citation Graph Regular Link Graph: Co-citation Graph:
What’s Next? Google’s New Search Index: Caffeine –Announced June 8 th, 2010 –Up to 50% fresher results –Twice as fast Real Time Search –Twitter / Facebook caffeine.html
References Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. Longzhuang Li, Yi Shang, Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi> / ] Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM 46 (5): 604– 632. doi: / von Ahn, Luis ( ). "Hubs and Authorities" (PDF) : Science of the Web Course Notes. Carnegie Mellon University. Retrieved
Q & A