HITS Hypertext Induced Topic Selection

Slides:



Advertisements
Similar presentations
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Advertisements

Course Web Site – Also linked from Blackboard Course Materials – Excel Tutorials – Access Tutorials – PPT.
A guide to HTML. Slide 1 HTML: Hypertext Markup Language Pull down View, then Source, to see the HTML code. Slide 1.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CSCE156: Introduction to Computer Science II Instructor Stephen Scott Website
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Internet Research Search Engines & Subject Directories.
The Further Mathematics network
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Search Engine Optimization
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Web Site Evaluation (or “What Makes a Good the Kenmore West High School Library Media Center.
1 California State University, Fullerton Chapter 8 Personal Productivity and Problem Solving.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Roy Tennant California Digital Library Is Metasearch Dead?
Using Hyperlink structure information for web search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSCI-235 Micro-Computer in Science Internet Search.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Contact Information Leena Razzaq WVH 310B Office Hours: Monday 12:00 – 2:00 pm And by appointment.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Session 5: How Search Engines Work. Focusing Questions How do search engines work? Is one search engine better than another?
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Search Engine Optimization
Education 499-R01 Search Basics.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Information Retrieval and Web Search
Web Crawling.
Marvin Library Web Page
Text & Web Mining 9/22/2018.
Search Engines & Subject Directories
Assignment help PHP + MySQL crash course
Gyozo Gidofalvi Uppsala Database Laboratory
Rank Aggregation.
HITS Hypertext Induced Topic Selection
Searching EIT, Author Gay Robertson, 2017.
Search Engines & Subject Directories
Search Engines & Subject Directories
Junghoo “John” Cho UCLA
Understanding Search Engines
Presentation transcript:

HITS Hypertext Induced Topic Selection 2/4/2019 HITS Hypertext Induced Topic Selection Gyozo Gidofalvi Uppsala Database Laboratory

Idea Given a set of web pages we want to find that are all concerned with the same topic we want to find the most interesting pages by examining the internal link structure in the set the pages that are most likely to guide us to an interesting pages 2019-02-04 Gyozo Gidofalvi

Foundation Identify Hubs and Authorities Definition is mutually recursive: A good hub is pointing to good authorities A good authority is pointed to by good hubs The hub value of a site is the sum of the authority values of the sites that the site is pointing to. The authority value of a site is the sum of the hub values of the sites that points to the site. 2019-02-04 Gyozo Gidofalvi

Pseudo-code Find a set of pages about a given subject You may use an existing search engine (such as Google) In the assignment, you are provided a bunch of pages with links Preprocess the link structure Initialize hub and authority vectors Normalize the vectors to length 1 Calculate the new authority vector based on the link structure and the hub vector Calculate the new hub vector based on the link structure and the authority vector If the new values of the hub and authority vectors are similar enough to the old ones we are done, otherwise repeat from 4 Sort the vectors and find the top authorities and hubs 2019-02-04 Gyozo Gidofalvi

Calculating the hub and authority vectors First we initialize the hub and authority vector to some value. What initial values are appropriate? Does it matter what we initialize to? Next, we calculate the new hub and authority vectors using the formulas Does it matter which order these calculations happen? Do we need to normalize the vectors in each iteration? How do we know when to stop? 2019-02-04 Gyozo Gidofalvi

Preprocessing Preprocessing will improve the accuracy o Several links may point to the same page; http://www.it.uu.se http://www.it.uu.se/index.html www.it.uu.se Remove site-internal links as this can make a site seem more important than it really is. Remove links to sites for which we do not know the link structure. 2019-02-04 Gyozo Gidofalvi

The assignment You will mine four different link structures for four different queries. We have done the web crawling and some of the preprocessing for you!  Input files are on the lab course web page However, you must Do some preprocessing yourselves Directions for pre-processing are on the lab course web page Validate your implementation Think of how to verify your solution Your validation does not have to be fancy not even automated At least, implement the test case on the following slide, and see what output it gives you. Make sure that the test case output is reasonable 2019-02-04 Gyozo Gidofalvi

Example (test case) Rank the pages according to hub and authority value in this link structure: a b c d 2019-02-04 Gyozo Gidofalvi