Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.

Slides:



Advertisements
Similar presentations
Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.
Advertisements

Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Hyper search ing the Web Soumen Chakrabarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins Jacob Kalakal Joseph CS.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
WEB MINING. Why IR ? Research & Fun
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Hyper-Searching the Web. Search Engines Basic Search (index) Cluster Search (themes) Meta-search (outsource) “Smarter” meta-search (themes + outsource)
Hypersearching the Web Hira Bashir - June 22, 2010 Soumen Chakarbarti, Byron Dom, S. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan & Andrew Tomkins.
Improving Hypertext Data using Pagelets and Templates Ziv Bar-Yossef U.C. Berkeley and IBM Almaden Sridhar Rajagopalan IBM Almaden 1.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 ICS 215: Advances in Database Management System Technology Spring 2004 Professor Chen Li Information and Computer Science University of California, Irvine.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Chapter 19: Information Retrieval
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Information Retrieval
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
Optimal Crawling Strategies for Web Search Engines Wolf, Sethuraman, Ozsen Presented By Rajat Teotia.
Search Engine Optimization
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
OFF Page SEO Tips & Tricks Step By Step By IT Team of SlideLearn.com.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Web Search Algorithms By Matt Richard and Kyle Krueger.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Data Mining for Web Intelligence Presentation by Julia Erdman.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Search Tools and Search Engines Searching for Information and common found internet file types.
Search Engines By: Faruq Hasan.
Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 SEARCHING FOR TRUTH Locating Information on the WWW chapter 5.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Information Retrieval
Methods and Apparatus for Ranking Web Page Search Results
A Comparative Study of Link Analysis Algorithms
HITS Hypertext Induced Topic Selection
CS 572 (Spring 2011) | Class Presentation | June 21, 2011
Information retrieval and PageRank
Data Mining Chapter 6 Search Engines
HITS Hypertext Induced Topic Selection
Junghoo “John” Cho UCLA
Chapter 31: Information Retrieval
Chapter 19: Information Retrieval
Presentation transcript:

Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada

Overview Why Do We Care? Purpose of The Paper? Solution by Clever Project Pros / Cons of the Paper Further Research

Why Do We Care? Web Link Analysis is crucial for efficient Crawling and Ranking algorithms Crawling: Google Sitemap Submission, Yahoo Directory Ranking: Relevant Result

Purpose of The Paper? To Overcome These Challenges: –Its Size & Growth –Its Content Types –Language Semantics –New Language –Staleness of Results –SPAM –And More…

Solution: Hyperlinks, Hyperlinks, Hyperlinks… Can Think of the Web as a Directed Graph Node = Web page (URL) Edge = Hyperlink

Solution: HITS Algorithm Hyperlink-Induced Topic Search (HITS) –A.k.a. Hubs and Authorities Hubs – Highly-valued lists for a given query –Ex. Yahoo Directory, Open Directory Project and Bookmarking sites. Authorities – Highly endorsed answers to the query –Ex. New York Times, Huffington Post, Twitter It is possible for a webpage to be both Hub and Authority –Ex. Restaurant Review Blogs

Solution: HITS Algorithm Cont… For each page p, we assign it two values hub(p) and auth(p) Initial Value: For all p, hub(p) = 1, auth(p) = 1 (or any predetermined number) Authority Update Rule: For each page p, update auth(p) to be the sum of the hub scores of all pages that point to it. Hub Update Rule: For each page p, update hub(p) to be the sum of the authority scores of all pages that point to it. Normalize and Repeat

Solution: HITS Algorithm Cont… Hub(p)Num of LinksRaw Score Sum: Authority Pages (q)Raw ScoreAuth(q) SJ Merc News Wall St. Journal New York Times USA Today Facebook Yahoo! Amazon Sum: Calculation

Pros: –Accurately addresses concerns and challenges we currently deal with –Great introduction to search engine algorithm –Briefly covered many topics (Breadth)

Cons: –Some materials are out of date (1999) –Ex. Google vs. Clever Project –Lack of Depth –Ex. Normalization of Hub and Auth values

Further Research: HITS Algorithm – Extreme Cases Large-in-small-out sites –High Auth(p) –No Problem Small-in-large-out sites –High Hub(p) –Problem

Further Research: HITS + Relevance Scoring Method Vector Space Model (VSM) –Documents and queries are represented by vectors –Term Frequency Okapi Measurement –Term Frequency + Document Length Cover Density Ranking (CDR) –Phrase Similarity (How close terms appear)

Further Research: HITS + Relevance Scoring Method Use Cosine Relevance Test Price Car

Further Research: HITS + Relevance Scoring Method Three-Level Scoring Method (TLS) –Manual Evaluation of Relevance Relevant Links = 2 points Slightly Relevant Links = 1 point Inactive Links + Error Links (404, 603) = 0 point Irrelevant Links = 0 point –Order of query terms matters

Further Research: Co-citation Graph Regular Link Graph: Co-citation Graph:

What’s Next? Google’s New Search Index: Caffeine –Announced June 8 th, 2010 –Up to 50% fresher results –Twice as fast Real Time Search –Twitter / Facebook caffeine.html

References Chakrabarti, Soumen; Dom, Byron; Kumar, S. Ravi; Raghavan, Prabhakar; Rajagopalan, Sridhar & Tomkins, Andrew. (1999). "Hypersearching the Web" [Article]. Scientific American, June1999, ():. Longzhuang Li, Yi Shang, Wei Zhang, Improvement of HITS-based algorithms on web documents, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA [doi> / ] Henzinger, M. (2001). Hyperlink analysis for the Web. IEEE Internet Computing, 5(1), Kleinberg, Jon (1999). "Authoritative sources in a hyperlinked environment" (PDF). Journal of the ACM 46 (5): 604– 632. doi: / von Ahn, Luis ( ). "Hubs and Authorities" (PDF) : Science of the Web Course Notes. Carnegie Mellon University. Retrieved

Q & A