Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson.

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
CSM06 Information Retrieval Lecture 5: Web IR part 2 Dr Andrew Salway
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Link Structure and Web Mining Shuying Wang
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Using Hyperlink structure information for web search.
CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Link-based and Content-based Evidential Information in a Belief Network Model I. Silva, B. Ribeiro-Neto, P. Calado, E. Moura, N. Ziviani Best Student Paper.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Search Engines By: Faruq Hasan.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
Post-Ranking query suggestion by diversifying search Chao Wang.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
22C:145 Artificial Intelligence
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Methods and Apparatus for Ranking Web Page Search Results
Lesson 6: Databases and Web Search Engines
Introduction to Web Mining
Text & Web Mining 9/22/2018.
Information Retrieval
Lecture 22 SVD, Eigenvector, and Web Search
Anatomy of a search engine
Information retrieval and PageRank
Data Mining Chapter 6 Search Engines
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Lesson 6: Databases and Web Search Engines
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
International Marketing and Output Database Conference 2005
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Introduction to Web Mining
Presentation transcript:

Finding related pages in the World Wide Web A review by: Liang Pan, Nick Hitchcock, Rob Elwell, Kirtan Patel and Lian Michelson

Content Introduction Algorithms –Companion –Co-citation –Netscapes Evaluations Critique Conclusion

Introduction Searching on the World Wide Web Common search tools include Google, Yahoo Traditional Approach Keyword Query based Need to specify your information needs by giving relevant keywords Prone to errors! Question! What do I do if I dont know exactly what I am looking for?

Introduction Another Way… –Use URL as search input instead of a phrase of text e.g. What are the requirements? – Fast – High precision – Little input data

Introduction How does it work? - Web graph structure –Proposed two algorithms: Companion Derived from HITS (Hyperlink Induced Topic Search ) algorithm proposed by Kleinberg for ranking search queries. Makes use of weights, hub and authority scores. Co-citation Finds pages that are frequently co-cited with an input URL u. Sites A,B,C Sites X,Y,Z u Found X,Y,Z

Companion Algorithm Takes in a starting URL u as input e.g. Made up of 4 steps: –Building the vicinity graph of u –Contract duplicates and near-duplicates in the graph –Compute edge weights based on host to host connection –Compute a hub score and a authority score for each node in the graph and return the top ranked authority nodes

Companion Algorithm Uses 5 values* to help determine relevant pages: Go Back (B): How many parent sites the website has i.e. going from u 1 to p 1 Back-Forward (BF): How many child sites the parent has i.e. going from u 1 to p 2 then to u 2 (or u 1 ) Forward (F): How many children the site has (pages it links to) i.e. u 1 to c 1 Forward-Back (FB): How many parent sites the children have i.e. u 1 to c 1 to u 3 STOP list: websites considered not to be relevant to the pages content p1p1 u1u1 c1c1 p2p2 u2u2 c2c2 hyperlinks u3u3 STOP List: *These values are determined before the algorithm is executed A Web-Graph website

Companion Algorithm Step 1 – Building the vicinity graph of u If u is part of the STOP list then it is ignored, otherwise all other sites in the list will be ignored p1p1 c1c1 p2p2 u2u2 c2c2 u3u3 Vicinity graph after step 1

Companion Algorithm

Step 2 – Eliminate any duplication –If one of the nodes (website) in the graph has 10 or more links plus has 95% of it links common to another node* Combined the links from both nodes (union) to create one node –This is to remove sites that are likely to be the same (e.g. mirror sites, or same site under different names) Step 3 – Assign Edge Weights –If two nodes are on the same host then the edge between them will be set to zero –If there are k links going to one site (i.e. many-to-one), the node edges authority weight are set to 1/k –If there are multiple links L from one site (i.e. one-to-many), the node edges hub weight are set to 1/L The vicinity graph of u has now been constructed! *This clearly has its problems!!!

Companion Algorithm Step 4 – Compute Hub and Authority scores Nodes (websites) with a high authority score are expected to have relevant content Nodes with a high hub score are expected to contain links to relevant content The 10 highest authority scoring nodes are then returned as relevant pages to the starting URL u

Co-citation Algorithm Two sites are co-cited if they have a common parents e.g. u 3 and u 1 are co-cited by p 1 Degree of co-citation (DoC) is the number of common parents a site has e.g. u 3 and u 1 have a DoC of 2 The algorithm finds the sibling of a site, computes their DoC and returns the top 10 sites with the highest DoC If number of siblings of u < 15 and DoC of u < 2 then algorithm restarts with a URL one level up from the original e.g. If u = a.com/X/Y/Z then new u = a.com/X/Y p1p1 u1u1 p2p2 u2u2 u3u3 Siblings of u 1

Netscapes Approach What's Related function Not a lot of detail mentioned in the paper! Gets similar pages from web crawling, archiving, categorising and data mining (as opposed to just using the web graph like the previous algorithms) Also tries to learn from trends (i.e what user click on after they searched for a keyword)

Implementation Compaqs Connectivity Server –Provides 180 million URL (nodes) Multi-threaded server to take in URLs –Uses either the Companion or Cocitation algorithm to find related pages.

Evaluation Studies carried out to determine the performance of these algorithms. Benchmark against Netscapes approach. Re-visit initial requirements. –Speed –Precision –Little Data Input – already achieved

Evaluation Speed –109 milliseconds for Companion, and 195ms for Cocitation. –Complexity of the Cocitation algorithm is in the order of O(n log n). Precision

Critique Faults within HITS not investigated. Nomura, Satoshi, and Hayamizu, Analysis and Improvement of HITS Algorithm for Detecting Web Communities, show some of the problems with the algorithm. Requires the user to have found something relevant to what they are looking for. i.e. I have found NYTimes, I want to have a look at what alternatives are available. Can it handle the scale of the web today? Tested with 180 million connectivity information. Indexable web size stands at over 11 billion Links to friends web pages that are non-relevant to the input URL will be taken into account, consider the size of the web today, this may lead to bad results. Small, specialised population used in test, lack of general approach. 'Two click away' idea not the case today.

Critique Looking at the positives The algorithms used indeed outperform Netscapes algorithm for finding related pages, and can be extended to handle more than one input URL* Easy to implement Many papers were consulted and used during the process of writing and implementing the work. *at the time (1999)

Applications and Future Work Data Mining - Web Structure Mining –Finding authoritative Web pages Classifying Web documents –Exploring Co-cited material, if they are linked, they could have relevance, if one is pointed to, it could be important. Extend the algorithm to increase the heuristic and look beyond the 'two click away idea'. Lack of further work because the assumption was so unrealistic to today's standards

Conclusion Suggested a solution to deal with the problem of searching for a topic that can not be easily expressed in simple text query. Companion and Co-citation algorithms are fast ways of doing search that is different to traditional text queries. Obtained a solution that can be easily adapted and implemented into web servers.

Q & A Any Questions?

References Hyperlink structure of the Web G.O. Arocena, A.O. Mendelzon and G.A. Mihaila, Applications of a web query language, in: Proc. Of the Sixth International World Wide Web Conference. Chakrabarti et al., Enhanced Hypertext Categorisation using Hyperlinks, in which links and their orders are used to categorise Web pages. E. Spertus, ParaSite: Mining Structural Information on the Web, also suggested using cocitation and other forms of connectivity to identify related Web pages Authoritative Sources in a Hyperlinked Environment. The HITS algorithm is used as a starting point for the companion algorithm, which is extended and modified. Linkage Similarity Measures for the Classification of Web Documents, P'avel Calado, Marco Cristo, Marcos Andr'e Gon calves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Web Mining – A Bird's eye view, presentation by Sanjay Kumar Madria