Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.

Similar presentations


Presentation on theme: "Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada."— Presentation transcript:

1 Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada

2 Overview Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research

3 Why Do We Care? Web Link Analysis is crucial for efficient Crawling and Ranking algorithms Crawling: Google Sitemap Submission, Yahoo Directory Ranking: Relevant Result

4 Purpose of The Paper? To Overcome These Challenges: –Its Size & Growth –Its Content Types –Language Semantics –New Language –Staleness of Results –SPAM –And More…

5 Solution: Hyperlinks, Hyperlinks, Hyperlinks… Can Think of the Web as a Directed Graph Node = Web page (URL) Edge = Hyperlink

6 Solution: HITS Algorithm Hyperlink-Induced Topic Search (HITS) –A.k.a. Hubs and Authorities Hubs – Highly-valued lists for a given query –Ex. Yahoo Directory, Open Directory Project and Bookmarking sites. Authorities – Highly endorsed answers to the query –Ex. New York Times, Huffington Post, Twitter It is possible for a webpage to be both Hub and Authority –Ex. Restaurant Review Blogs

7 Solution: HITS Algorithm Cont… For each page p, we assign it two values hub(p) and auth(p) Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number) Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it. Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it. Normalize and Repeat

8 Solution: HITS Algorithm Cont… Hub(p)Num of LinksRaw Score 0.24930.747 0.32141.284 0.18120.362 0.12320.246 0.08820.176 0.0151 0.01820.036 0.0031 1 Sum:1.00 2.872 Authority Pages (q)Raw ScoreAuth(q) SJ Merc News0.570.198 Wall St. Journal0.570.198 New York Times0.8740.304 USA Today0.590.205 Facebook0.1230.043 Yahoo!0.1210.042 Amazon0.0240.008 Sum: 1.000 Calculation

9 Pros: –Accurately addresses concerns and challenges we currently deal with –Great introduction to search engine algorithm –Briefly covered many topics (Breadth)

10 Cons: –Some materials are out of date (1999) –Ex. Google vs. Clever Project –Lack of Depth –Ex. Normalization of Hub and Auth values

11 Further Research: HITS Algorithm – Extreme Cases Large-in-small-out sites –High Auth(p) –No Problem Small-in-large-out sites –High Hub(p) –Problem

12 Further Research: HITS + Relevance Scoring Method Vector Space Model (VSM) –Documents and queries are represented by vectors –Term Frequency Okapi Measurement –Term Frequency + Document Length Cover Density Ranking (CDR) –Phrase Similarity (How close terms appear)

13 Further Research: HITS + Relevance Scoring Method Use Cosine Relevance Test Price Car

14 Further Research: HITS + Relevance Scoring Method Three-Level Scoring Method (TLS) –Manual Evaluation of Relevance Relevant Links = 2 points Slightly Relevant Links = 1 point Inactive Links + Error Links (404, 603) = 0 point Irrelevant Links = 0 point –Order of query terms matters

15 Further Research: Co-citation Graph Regular Link Graph: Co-citation Graph:

16 What’s Next? Google’s New Search Index: Caffeine –Announced June 8 th, 2010 –Up to 50% fresher results –Twice as fast Real Time Search –Twitter / Facebook http://googleblog.blogspot.com/2010/06/our-new-search-index- caffeine.html

17 References Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. Longzhuang Li, Yi Shang, Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi>10.1145/511446.511514] Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), 45-50. Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM 46 (5): 604– 632. doi:10.1145/324133.324140. von Ahn, Luis (2008-10-19). "Hubs and Authorities" (PDF). 15-396: Science of the Web Course Notes. Carnegie Mellon University. Retrieved 2008-11-09.

18 Q & A


Download ppt "Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada."

Similar presentations


Ads by Google