1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006.

Slides:



Advertisements
Similar presentations
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 PageSim: A Link-based Measure of Web Page Similarity Research Group Presentation Allen Z. Lin, 8 Mar 2006.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Distributed PageRank Computation Based on Iterative Aggregation- Disaggregation Methods Yangbo Zhu, Shaozhi Ye and Xing Li Tsinghua University, Beijing,
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
1 Hyperlink Analysis A Survey (In Progress). 2 Overview of This Talk  Introduction to Hyperlink Analysis  Classification of Hyperlink Analysis  Two.
Relevance Propagation for Web Search Dr. Tie-Yan Liu Web Search and Mining Group Microsoft Research Asia Joint Work with Tao Qin, Tsinghua University.
Link Structure and Web Mining Shuying Wang
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Prestige (Seeley, 1949; Brin & Page, 1997; Kleinberg,1997) Use edge-weighted, directed graphs to model social networks Status/Prestige In-degree is a good.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Web Page Clustering based on Web Community Extraction Chikayama-Taura Lab. M2 Shim Wonbo.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Presented by: Apeksha Khabia Guided by: Dr. M. B. Chandak
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Web Mining Class Nam Hoai Nguyen Hiep Tuan Nguyen Tri Survey on Web Structure Mining
Chapter 6: Information Retrieval and Web Search
Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
Link-based Similarity Measurement Techniques and Applications Department of Computer Science & Engineering The Chinese University of Hong Kong Zhenjiang.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Link-based and Content-based Evidential Information in a Belief Network Model I. Silva, B. Ribeiro-Neto, P. Calado, E. Moura, N. Ziviani Best Student Paper.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 Authors: Glen Jeh, Jennifer Widom (Stanford University) KDD, 2002 Presented by: Yuchen Bian SimRank: a measure of structural-context similarity.
Hongbo Deng, Michael R. Lyu and Irwin King
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Glen Jeh & Jennifer Widom KDD  Many applications require a measure of “similarity” between objects.  Web search  Shopping Recommendations  Search.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Evaluation of Bipartite-graph-based Web Page Clustering Shim Wonbo M1 Chikayama-Taura Lab.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Neighborhood - based Tag Prediction
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
HITS Hypertext-Induced Topic Selection
CIKM’ 09 November 3rd, 2009, Hong Kong
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Zhenjiang Lin, Michael R. Lyu and Irwin King
Improved Algorithms for Topic Distillation in a Hyperlinked Environment (ACM SIGIR ‘98) Ruey-Lung, Hsiao Nov 23, 2000.
Junghoo “John” Cho UCLA
Presentation transcript:

1 Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure Allen, Zhenjiang LIN CSE, CUHK 13 Dec 2006

2 Outline 1. Introduction 2. Extended Neighborhood Structure Model 3. Extending Link-based Similarity Measures 4. Experimental Results 5. Conclusion and Future Work

3 1. Introduction Background  Similarity measures are required in many web applications to evaluate the similarity between web pages. The “similar pages” service of Web search engines; Web document classification; Web community identification. Problem  Many link-based similarity measures are not so accurate since they consider only part of the structural information.

4 1. Introduction Motivation  How to improve the accuracy of link-based similarity measures by making full use of the structural information? Contributions  Propose the Extended Neighborhood Structure (ENS) model. bi-direction multi-hop  Construct extended link-based similarity measures base on the ENS model. more flexible and accurate

5 1. Introduction Searching the Web  Keyword searching  Similarity searching Search Engine KEYWORDS: news … Search Engine URL: … similarity measure

6 Similarity measures  Evaluate how similarity or related two objects are. Approaches to measuring similarity  Text-based Cosine TFIDF [Joachims97]  Link-based Bibliographic coupling [Kessler63] Co-citation [Small73] SimRank [Jeh et al 02], PageSim [Lin et al 06]  Hybrid 1. Introduction Focus of this talk

7 Extended Neighborhood Structure (ENS) model  Question: what hide in hyperlinks? similarity relationship between pages, similarity relationship decrease along hyperlinks. 2. Extend Neighborhood Structure Model

8 Extended Neighborhood Structure (ENS) model  The ENS model bi-direction  in-link  out-link multi-hop  direct (1-hop)  indirect (2-hop, 3-hop, etc)  Purpose Improve accuracy of link-based similarity measures by helping them make full use of the structural information of the Web.

9 3. Extending Link-based Similarity Measures Intuition of similarity  Similar web pages have similar neighbors. (to compare two web pages, see their neighbors.) Notations  G=(V, E), |V| = n: the web graph.  I(a) / O(a): in-link / out-link neighbors of web page a.  path(a 1, a s ): a sequence of vertices a 1, a 2, …, a s such that (a i, a i+1 ) ∈ E (i=1,…,s-1) and a i are distinct.  PATH(a,b): the set of all possible paths from page a to b.  Sim(a,b): similarity score of web page a and b.

10 3. Extending Link-based Similarity Measures Two classical methods  Co-citation: the more common in-link neighbors, the more similar. Sim(a,b) = |I(a) ∩ I(b)|  Bibliographic coupling: the more common out-link neighbors, the more similar. Sim(a,b) = |O(a) ∩ O(b)| Extended Co-citation and Bibliographic Coupling (ECBC)  ECBC: the more common neighbors, the more similar. Sim(a,b) = α|I(a) ∩ I(b)| + (1 - α)|O(a) ∩ O(b)|, where 0≤α≤1 is a constant.

11 3. Extending Link-based Similarity Measures SimRank “two pages are similar if they are linked to by similar pages”  (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition  C is a constant between 0 and 1.  The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠ v.

12 3. Extending Link-based Similarity Measures Extended SimRank “two pages are similar if they have similar neighbors”  (1) Sim(u,u)=1; (2) Sim(u,v)=0 if |I(u)| |I(v)| = 0. Recursive definition  C is a constant between 0 and 1.  The iteration starts with Sim(u,u)=1, Sim(u,v)=0 if u≠ v.

13 3. Extending Link-based Similarity Measures PageSim “weighted multi-hop” version of Co-citation algorithm.  (a) multi-hop in-link information, and  (b) importance of web pages. Can be represented by any global scoring system  PageRank scores, or  Authoritative scores of HITS.

14 3. Extending Link-based Similarity Measures PageSim (phase 1: feature propagation)  Initially, each web page contains an unique feature information, which is represented by its PageRank score.  The feature information of a web page is propagated along out-link hyperlinks at decay rate d. The PR score of u propagated to v is defined by

15 3. Extending Link-based Similarity Measures PageSim (phase 2: similarity computation)  A web page v stores the feature information of its and others in its Feature Vector FV(v).  The similarity between web page u and v is computed by Jaccard measure [Jain et al 88]  Intuition: the more common feature information two web pages contain, the more similar they are.

16 3. Extending Link-based Similarity Measures Extended PageSim (EPS)  Propagating feature information of web pages along in-link hyperlinks at decay rate 1- d.  Computing the in-link PS scores.  EPS(u,v) = in-link PS(u,v) + out-link PS(u,v).

17 Properties CC: Co-citation, BC: Bibliographic Coupling, ECBC: Extended Co-citation and Bibliographic Coupling, SR: SimRank, ESR: Extended SimRank, PS: PageSim, EPS: Extended PageSim.  Summary The extended versions consider more structural information. ESR and EPS are bi-directional & multi-hop. In ESR, two web pages are not similar unless there are intermediate pages between them, even if they link to other (see Figure 1(2)). 3. Extending Link-based Similarity Measures

18 Case study: Sim(a,b)  Summary The extended algorithms are more flexible. EPS is able to handle more cases. 3. Extending Link-based Similarity Measures

19 4. Experimental Results Datasets  CSE Web (CW) dataset: A set of web pages crawled from 22,000 pages, 180,000 hyperlinks. The average number of in-links and out-links are 8.6 and 7.7.  Google Scholar (GS) dataset: A set of articles crawled from Google Scholar searching engine. Start crawling by submitting “web mining” keywords to GS, and then following the “Cited by” hyperlinks. 20,000 articles, 154,000 citations.

20 4. Experimental Results Evaluation Methods  Cosine TFIDF similarity (for CW dataset) A commonly used text-based similarity measure.  “Related Articles” (for GS dataset) A list of related articles to a query article provided by GS. Can be used as ground truth. Parameter Settings

21 4. Experimental Results CC, BC vs ECBC  CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: top N results; y-axis: average precision of all pages.

22 4. Experimental Results SimRank vs Extended SimRank  CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: top N results; y-axis: average precision of all pages.

23 4. Experimental Results PageSim vs Extended PageSim  CW data (left): x-axis: top N results; y-axis: average cosine TFIDF of all pages.  GS data (right): x-axis: top N results; y-axis: average precision of all pages.

24 4. Experimental Results Overall Accuracy of Algorithms

25 5. Conclusion and Future Work Conclusion  Extended Neighborhood Structure model Bi-direction and multi-hop  Extend existing link-based similarity measures Co-citation, Bibliographic coupling, SimRank, PageSim  Experiments Future Work  Extend link-based algorithms based on ENS model  Prove the convergence of the Extended SimRank  Integrating link-based with text-based

26 Publications Z. Lin, M. R. Lyu, and I. King. PageSim: A novel link-based measure of web page similarity. In WWW '06: Proceedings of the 15th international conference on World Wide Web. Pages , Edinburgh, Scotland, Z. Lin, I. King, and M. R. Lyu. PageSim: A novel link-based similarity measure for the World Wide Web. In WI ’06: Proceedings of the 5th International Conference on Web Intelligence. ACM Press. To appear, Z. Lin, M. R. Lyu, and I. King. Extending Link-based Algorithms for Similar Web Pages with Neighborhood Structure. Submitted to WWW’07.