Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.

Slides:



Advertisements
Similar presentations
Lindsey Bleimes Charlie Garrod Adam Meyerson
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
TrustRank Algorithm Srđan Luković 2010/3482
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
CIKM’2008 Presentation Oct. 27, 2008 Napa, California
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Chen Cheng1, Haiqin Yang1, Irwin King1,2 and Michael R. Lyu1
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Distributed PageRank Computation Based on Iterative Aggregation- Disaggregation Methods Yangbo Zhu, Shaozhi Ye and Xing Li Tsinghua University, Beijing,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
1 Heat Diffusion Model and its Applications Haixuan Yang Term Presentation Dec 2, 2005.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Network Science and the Web: A Case Study Networked Life CIS 112 Spring 2009 Prof. Michael Kearns.
Data Processing with Missing Information Haixuan Yang Supervisors: Haixuan Yang Supervisors: Prof. Irwin King & Prof. Michael R. Lyu.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Exploiting Web Matrix Permutations to Speedup PageRank Computation Presented by: Aries Chan, Cody Lawson, and Michael Dwyer.
Addressing Incompleteness and Noise in Evolving Web Snapshots KJDB2007 Masashi Toyoda IIS, University of Tokyo.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Population and Sample The entire group of individuals that we want information about is called population. A sample is a part of the population that we.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
Algorithmic Detection of Semantic Similarity WWW 2005.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
By: Jesse Ehlert Dustin Wells Li Zhang Iterative Aggregation/Disaggregation(IAD)
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Extrapolation to Speed-up Query- dependent Link Analysis Ranking Algorithms Muhammad Ali Norozi Department of Computer Science Norwegian University of.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
Iterative Aggregation Disaggregation
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Presentation transcript:

Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004

2 Outline Introduction Introduction Related Work Related Work Predictive Ranking Model Predictive Ranking Model Block Predictive Ranking Model Block Predictive Ranking Model Experiment Setup Experiment Setup

3 Introduction PageRank (1998) PageRank (1998) –It uses the link information to rank web page; –The importance of a page depends on the number of pages that point to it; –The importance of a page also depends on the importance of pages that point to it. –If X is the rank vector, Problems Problems –Manipulation –The “richer-get-richer” phenomenon –Computation Efficiency –Dangling nodes problem

4 Introduction Nodes that either have no out-link or for which no out-link is known are called dangling nodes. Nodes that either have no out-link or for which no out-link is known are called dangling nodes. Dangling nodes problem Dangling nodes problem –It is hard to sample the entire web. Page et al (1998) reported that they have 51 million URL not downloaded yet when they have 24 million pages downloaded.Page et al (1998) reported that they have 51 million URL not downloaded yet when they have 24 million pages downloaded. Handschuh et al (2003) estimated that dynamic pages are 100 times more than static pages.Handschuh et al (2003) estimated that dynamic pages are 100 times more than static pages. Eiron et al (2004) reported that the number of uncrawled pages (3.75 billion) still far exceeds the number of crawled pages (1.1 billion).Eiron et al (2004) reported that the number of uncrawled pages (3.75 billion) still far exceeds the number of crawled pages (1.1 billion). –Including dangling nodes in the overall ranking may not only change the rank value of non-dangling nodes but also change the order of them.

5 An example If we ignore the dangling node 3, then the ranks for nodes 1 and 2 are. If we consider the dangling node 3, then the ranks are by the revised pagerank algorithm (Kamvar 2003).

6 Introduction Classes of Dangling nodes Classes of Dangling nodes –Nodes that are found but not visited at current time are called dangling nodes of class 1. –Nodes that have been tried but not visited successfully are called dangling nodes of class 2. –Nodes, which have been visited successfully but from which no outlink is found, are called dangling nodes of class 3. Handle different kind of dangling nodes in different way. Our work focuses on dangling nodes of class 1, which cause missing information. Handle different kind of dangling nodes in different way. Our work focuses on dangling nodes of class 1, which cause missing information.

Illustration of dangling nodes At time 1 visited node:1. Dangling nodes of class 1: 2, 4, 5,7. At time 2, Dangling nodes of class 3 : 7 Visited nodes : 1,7,2; Dangling nodes of class 1: 3,4,5,6 77 Known information at time 2: red links Missing information at time 2: White links

8 Related work Page (1998): Simply removing them. After doing so, they can be added back in. The details are missing. Page (1998): Simply removing them. After doing so, they can be added back in. The details are missing. Amati (2003): Handle dangling nodes robustly based on a modified graph. Amati (2003): Handle dangling nodes robustly based on a modified graph. Kamvar (2003): Add uniform random jump from dangling nodes to all nodes. Kamvar (2003): Add uniform random jump from dangling nodes to all nodes. Eiron (2004): Speed up the model in Kamvar (2003), but sacrificing accuracy. Furthermore, suggest algorithm that penalize the nodes that link to dangling nodes of class 2. Eiron (2004): Speed up the model in Kamvar (2003), but sacrificing accuracy. Furthermore, suggest algorithm that penalize the nodes that link to dangling nodes of class 2.

9 Related work - Amati (2003)

10 Related work - Kamvar (2003)

11 Related work - Eiron (2004)

12 Predictive Ranking Model For dangling nodes of class 3, we use the same technique as Kamvar (2003). For dangling nodes of class 3, we use the same technique as Kamvar (2003). For dangling nodes of class 2, we ignore them at current model although it is possible to combine the push-back algorithm (Eiron 2004) with our model. (Penalizing nodes is a subjective matter.) For dangling nodes of class 2, we ignore them at current model although it is possible to combine the push-back algorithm (Eiron 2004) with our model. (Penalizing nodes is a subjective matter.) For dangling nodes of class 1, we try to predict the missing information about the link structrue. For dangling nodes of class 1, we try to predict the missing information about the link structrue.

13 Predictive Ranking Model Suppose that all the nodes V can be partitioned into three subsets:. Suppose that all the nodes V can be partitioned into three subsets:. – denotes the set of all nodes that have been crawled successfully and have at least one out-link; – denotes the set of all dangling nodes of class 3; – denotes the set of all dangling nodes of class 1; For each node in V, the real in-degree of v is not known. For each node in V, the real in-degree of v is not known. For each node v in, the real out-degree of v is known. For each node v in, the real out-degree of v is known. For each node v in, the real out-degree of v is known to be zero. For each node v in, the real out-degree of v is known to be zero. For each node v in, the real out-degree of v is unknown. For each node v in, the real out-degree of v is unknown.

14 Predictive Ranking Model We predict the real in-degree of v by the number of found links from C to v. We predict the real in-degree of v by the number of found links from C to v. –Assumption: the number of found links from C to v is proportional to the real number of links from V to v. For example, For example, if C and have 100 nodes, if C and have 100 nodes, V has 1000 nodes, V has 1000 nodes, and if the number of links from C to v is 5, and if the number of links from C to v is 5, then we estimate that the number of links from V to v is 50. then we estimate that the number of links from V to v is 50. The difference between these two numbers is distributed uniformly to the nodes in. The difference between these two numbers is distributed uniformly to the nodes in.

15 Predictive Ranking Model Models the missing information from unvisited nodes to nodes in V. Model the known link information as Page (1998): from C to V. Model the user’s behavior as Kamvar (2003) when facing dangling nodes of class 3.

16 Predictive Ranking Model Model users’ behavior (called as “teleportation”) as Page (1998) and Kamvar (2003) when the users get bored in following actual links and they may jump to some nodes randomly. is the rank vector.

17 Block Predictive Ranking Model Predict the in-degree of v more accurately. Predict the in-degree of v more accurately. Divide all nodes into p blocks (v[1], v[2], …, v[p]) according to their top level domains (for example, edu), or domains (for example, stanford.edu), or the countries (for example, cn). Divide all nodes into p blocks (v[1], v[2], …, v[p]) according to their top level domains (for example, edu), or domains (for example, stanford.edu), or the countries (for example, cn). Assumption: the number of found links from C[i] (C[i] is the meet of C and V[i]) to v is proportional to the real number of links from V[i] to v. Consequently, the matrix A is changed. Assumption: the number of found links from C[i] (C[i] is the meet of C and V[i]) to v is proportional to the real number of links from V[i] to v. Consequently, the matrix A is changed. Other parts are same as the Predictive Ranking Model. Other parts are same as the Predictive Ranking Model.

18 Block Predictive Ranking Model Models the missing information from unvisited nodes in 1st block to nodes in V.

19 Experiment Setup Get two datasets (by Patrick Lau): one is within the domain cuhk.edu.hk, the other is outside this domain. In the first dataset, we snapshot 11 matrices during the process of crawling; in the second dataset, we snapshot 9 matrices. Get two datasets (by Patrick Lau): one is within the domain cuhk.edu.hk, the other is outside this domain. In the first dataset, we snapshot 11 matrices during the process of crawling; in the second dataset, we snapshot 9 matrices. Apply both Predictive Ranking Model and the revised RageRank Model (Kamvar 2003) to these snapshots. Apply both Predictive Ranking Model and the revised RageRank Model (Kamvar 2003) to these snapshots. Compare the results of both models at time t with the future results of both models. Compare the results of both models at time t with the future results of both models. –The future results rank more nodes than the current results. So it is difficult to make a direct comparison.

20 Illustration for comparison future result current result Cut Normalize Comparison by 1-norm

21 Within domain cuhk.edu.hk Data Description Time t Vnum[t] Tnum[t] Time t Predictive Ranking Page Ranking Number of iterations

22 Within domain cuhk.edu.hk Comparison Based on future PageRank result at time 11.

23 Within domain cuhk.edu.hk Comparison Based on future PreRank result at time 11

24 Outside cuhk.edu.hk Data Description Time t Vnum[t] Tnum[t] Time t Predictive Ranking Page Ranking Number of iterations

25 Outside cuhk.edu.hk Comparison Based on future PageRank result at time 9

26 Outside cuhk.edu.hk Comparison Based on future PreRank result at time 9

27 Q & A