1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006.

Slides:



Advertisements
Similar presentations
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Advertisements

Overview of this week Debugging tips for ML algorithms
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Singular Value Decomposition and Data Management
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
S eminar on Page Ranking Techniques In Search Engines Phapale Gaurav S. [05 IT 6010] Guide: Prof. A. Gupta.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Jerry Scripps N T O K M I N I N G E W R. Overview What is network mining? What is network mining? Motivation Motivation Preliminaries Preliminaries definitions.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Google & Document Retrieval Qing Li School of Computing and Informatics Arizona State University.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Overview of Web Ranking Algorithms: HITS and PageRank
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Link-based Similarity Measurement Techniques and Applications Department of Computer Science & Engineering The Chinese University of Hong Kong Zhenjiang.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.
CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
SimRank: A Measure of Structural-Context Similarity Glen Jeh and Jennifer Widom Stanford University ACM SIGKDD 2002 January 19, 2011 Taikyoung Kim SNU.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
OCR A-Level Computing - Unit 01 Computer Systems Lesson 1. 3
HITS Hypertext-Induced Topic Selection
CIKM’ 09 November 3rd, 2009, Hong Kong
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Link Counts GOOGLE Page Rank engine needs speedup
Zhenjiang Lin, Michael R. Lyu and Irwin King
Junghoo “John” Cho UCLA
Using Link Information to Enhance Web Page Classification
Presentation transcript:

1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006

2 Outline  What & Why?  Existing approaches  PageSim: a new approach  Demostrations  Conclusion and current work

3 What & Why?  Ranking similarity between web pages.  Applications on the Web –Finding related, or similar, web pages to a page. Google’s “Similar pages” Google’s “Similar pages” –Web page classification. YAHOO!‘s Web Directory. YAHOO!‘s Web Directory. hierarchical structure hierarchical structure  Key question: How to measure the similarity?

4 Existing approaches  Text-based –Using common features of two web pages. Jaccard’s coefficient, Adamic/Adar Jaccard’s coefficient, Adamic/Adar  Link-based –Using neighbors between two web pages. Common neighbor, Co-citation, SimRank Common neighbor, Co-citation, SimRank –Using paths between two web pages. Katz index, Hitting time Katz index, Hitting time

5 Existing approaches (cont.)  Notations –Sim(a,b): similarity score of web page a and b. –I(a): in-link neighbors of web page a. –O(a): out-link neighbors of web page a.  Common neighbor method –Sim(a,b) = |O(a)∩O(b)| = |(c,d)| = 2 = |(c,d)| = 2  Cocitation method –Sim(a,b) = |I(a)∩I(b)| = |(c,d)| = 2 = |(c,d)| = 2

6 Existing approaches (cont.)  SimRank –Two pages are similar if they are referenced (cited, or linked to) by similar pages. –1. Sim(u,u)=1; 2. Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition Recursive definition –C is a constant between 0 and 1. –The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠ v.

7 PageSim: a new approach  Two problems –On the Web, not all links are equally important. Common neighbor, Cocitation Common neighbor, Cocitation –A similarity measure should be able to measure the similarity between any two web pages. SimRank SimRank  PageSim –Take the above problems into account.

8 PageSim: a new approach (cont.)  Cocitation  Which page is more similar to d, c or e?  Suppose page a is YAHOO!’s homepage, and b is a personal web page. Authoritative pages are more important.

9 PageSim: a new approach (cont.)  SimRank  Are a and b similar? –SimRank says “NO”s. Are the answers reasonable? Are the answers reasonable?

10 PageSim: a new approach (cont.)  Page a linking to b and c means a “thinks” –b and c are kind of similar. –both b and c are kind of similar to a too.  Page a spreads similarity to its neighbors.  Authoritative pages spread more similarity.

11 PageSim: a new approach (cont.)  PageSim –In PageSim, PageRank (PR) score is used to measure the authority of a web page. PR assigns global importance scores to all web pages. PR assigns global importance scores to all web pages. –Each page spreads its own similarity score (PR score) to its neighbors. –Each page also propagates other pages’ similarity scores to its neighbors. –After the similarity score propagation finished, each page contains an array of similarity scores. –PageRank score propagation

12 PageSim: a new approach (cont.)  Example: similarity propagation (page a only) –PR(a)=100, PR(b)=55, PR(c)=102 –Each page propagate 80% of its similarity score averagely to its neighbors.

13 PageSim: a new approach (cont.)  Example: similarity propagation (cont.) –PR(a)=100, PR(b)=55, PR(c)=102 –Each page contains a similarity score vector(SV).  SV(a) = (100, 35, 82 ),  SV(b) = ( 40, 55, 33 ),  SV(c) = ( 72, 44, 102 ), –PageSim score (PS) computation  PS(a,b)=Σmin( SV(a), SV(b) ) = = 108 –Two pages are more similar if they share more common similarity scores.

14 PageSim: a new approach (cont.)  Example: similarity spreading (cont.) –PageSim score matrix  PS_matrix = (PS(u,v)) nxn = a: 217 b: c: –PS_matrix is symmetric.  PS(a,b) = PS(b, a) –Any web page is most similar to itself.  PS(u,u) = max ( PS(u,v) ), for any v.

15 Demostrations  Example 1: single link –PageSim matrix a: 100 b: c: d: –PR = (100, 185, 257.2, 318.6) –SimRank matrix

16 Demostrations (cont.)  Example 2: loop link –PageSim matrix a: b: c: d: –PR = (100, 100, 100, 100) –SimRank matrix

17 Demostrations (cont.)  Example 3: more complex –PageSim matrix 1: : : : : PR = (100, 40.0, 50.7, 10.7, 10.7) –SimRank matrix 1: 1 2: 0 1 3: : : –PageSim results  v 3 is most similar to v 1.  v 4 is most similar to v 2.

18 Conclusion and current work  Conclusion –Web page similarity measures Text-based & Link-based –PageSim: PageRank score propagation.  Current work –Propagation radius pruning. –How to compare performance of two similarity measures, e.g., PageSim and SimRank? Text-based measures. Text-based measures. Thank you!