Link Structure and Web Mining Shuying Wang 2003.11.

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Mining Web’s Link Structure Sushanth Rai University of Texas at Arlington
Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
1 Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented by Yongqiang Li Adapted from
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Authoritative Sources in a Hyperlinked Environment By: Jon M. Kleinberg Presented by: Yemin Shi CS-572 June
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presented By: Talin Kevorkian Summer June
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Order Out of Chaos Analyzing the Link Structure of the Web for Directory Compilation and Search. Presented by Benjy Weinberger.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Presented By: - Chandrika B N
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg Presentation by Julian Zinn.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Social Networking Algorithms related sections to read in Networked Life: 2.1,
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
Overview of Web Ranking Algorithms: HITS and PageRank
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Analysis of Link Structures on the World Wide Web and Classified Improvements Greg Nilsen University of Pittsburgh April 2003.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
CS155b: E-Commerce Lecture 16: April 10, 2001 WWW Searching and Google.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
1 CS 430: Information Discovery Lecture 5 Ranking.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Text & Web Mining 9/22/2018.
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Link Structure and Web Mining Shuying Wang

Outline Part one: Link Structure and Web Mining Part two: Analysis of Link Structure Topic covered: - Web mining methods - Text based Web mining - Web graph -- Bow tie theory - Eigenvalue and Eigenvector - Authorities & Hubs - Hits (Hyperlink-Induced Topic Search) - PageRank

Challenges for Web Search The WWW is a vast collection of information: over 3 billion text pages plus a multitude of multimedia files. Over a million new resources are added every day. Huge Complex Dynamic Diversity Different User Group How do we find the information we need in such a large collection? Search is the most common activity on the web after .

Web Mining Method Web content mining - Context, Keyword, Document classification Web structure mining - Link structure and link text Web usage mining - Weblog, URL, timestamp, IP and web page content

Limitations of text based analysis Web pages Web database Keyword Text-based ranking function Eg. Could be recognized as one of the most authoritative pages, since many other web pages contain “harvard” more often. Pages are not sufficiently self – descriptive Usually the term “search engine” doesn't’t appear on search engine web pages

Bow-tie Theory

What are the benefits of link building? Following a link is one of the most popular ways for people to find new sites. By providing links to other material people don't have to re-invent the wheel. Inbound links help to build trust. Link structure and link text provide a lot of information for making relevance judgments and quality filtering The link structure implies an underlying social structure in the way that pages and links are created, and it is an understanding of this social organization that can provide us the most leverage.

Queries and Authoritative Sources Types of queries Specific queries E.g., “Does Netscape support the JDK 1.3?” Broad-topic queries E.g., “Find information about the Java programming language.” Similar-page queries E.g., “Find pages java.sun.com” Authoritative pages –relative to broad-topic query It is not sufficient to collect a large number of potentially relevant page from text-based methods. Authorities are often not particularly self-descriptive

Authorities and Hubs A good authority is a page that is pointed by many good hubs, while a good hub is a page that points to many good authorities. This is the mutually reinforcing relationship. The authority pages are those that contain the most definitive, central, and useful information in the context of particular topics. Hubs that link to a collection of prominent sites on a common topic hubs authorities

Hits (Hyperlink-Induced Topic Search) The focused subgraph is created by first taking the highest-ranked pages from a text-based search engine as a root set R. R is expanded into the base set S by taking all sites pointing to or pointed at by a site in R. Note that while R may fail to contain some “important” authorities, S will probably contain them. … … u R1R1 RnRn S1S1 SnSn Root set Base set

Computing Hubs and Authorities(1) (3)(4) Number the pages{1,2,…n} and define their adjacency matrix A to be the n*n matrix whose (i,j) th entry is equal to 1 if page i links to page j, and is 0 otherwise. Define a=(a 1,a 2,…,a n ) and h=(h 1,h 2,…,h n ). For each page p, we associate a non-negative authority weight a p and a non-negative hub weight h p. (2)(1)

Computing Hubs and Authorities(2) In other words, a is an eigenvector of B: B is the co-citation matrix: B(i,j) is the number of sites that jointly point to both i and j. B is symmetric and has n orthogonal unit eigenvectors. (5) (6) (7) Let

Computing Hubs and Authorities(3) –We initialize a(p) = h(p) = 1 for all p. –We iterate the following operations: –And renormalize after each iteration

Computing Hubs and Authorities(4) The eigenvectors of B are precisely the stationary points of this process. h is the principal eigenvector of A T A, and a is the principal eigenvector of AA T. The principal eigenvector represents the “densest cluster” within the focused subgraph. By initializing a(p)=h(p)=1, a will converge to the principal eigenvector of B. –Initializing differently may lead to convergence to a different eigenvector. –In practice convergence is achieved after only iterations.

PageRank (Simple structure of Google search engine) TextIndex() PageRank() query Query Processor Ranked results Web Page rank Inverted Text index offline Query-time

PageRank Computing (C <1) Let A be a square matrix with rows and columns corresponding to web pages. Let If let R as vector over web pages, Then R = cAR. (2) R is an eigenvector of A with eigenvalue c. (1) u: web page v: page links to u Bu: the set of pages c: a factor for normilization

Hits and PageRank PageRank - Offline computing - Focuses on authoritative pages - Computing all the web pages Hits: - Query time computing - Seeks good hub pages - Computing the base set pages

Conclusion A technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic. Related work. Standing, influence in social networks, scientific citations, etc. Hypertext and WWW rankings …

Reference Mining the Link Structure of the World Wide Web Jon Kleinberg Authoritative Sources in a Hyperlinked Environment Jon Kleinberg The PageRank Citation Ranking: Bringing Order to the Web Larry Page Effective Finding Relevant Web Pages from Linkage Information Jingyu Hou Yanchun Zhang Data Mining Concepts and Techniques JiaWei Han Micheline Kamber