Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang.

Slides:

Advertisements

Similar presentations

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

The PageRank Citation Ranking “Bringing Order to the Web”

Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.

Link Analysis, PageRank and Search Engines on the Web

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.

Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.

Presented By: - Chandrika B N

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Entropy Rate of a Markov Chain

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Adversarial Information Retrieval The Manipulation of Web Content.

1 Applications of Relative Importance  Why is relative importance interesting? Web Social Networks Citation Graphs Biological Data  Graphs become too.

MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.

Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.

Using Hyperlink structure information for web search.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

Overview of Web Ranking Algorithms: HITS and PageRank

Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.

Optimal Link Bombs are Uncoordinated Sibel Adali Tina Liu Malik Magdon-Ismail Rensselaer Polytechnic Institute.

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.

Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.

Algorithmic Detection of Semantic Similarity WWW 2005.

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

Understanding Google’s PageRank™ 1. Review: The Search Engine 2.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.

CSE 450 – Web Mining Seminar Professor Brian D. Davison Fall 2005 A PRESENTATION on What is this Page Known for? Computing Web Page Reputations D. Rafiei.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Block-level Link Analysis Presented by Lan Nie 11/08/2005, Lehigh University.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.

Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.

The PageRank Citation Ranking: Bringing Order to the Web

The PageRank Citation Ranking: Bringing Order to the Web

CS 440 Database Management Systems

Link Structure Analysis

Presentation transcript:

Nadav Eiron, Kevin S.McCurley, JohA.Tomlin IBM Almaden Research Center WWW’04 CSE 450 Web Mining Presented by Zaihan Yang

Introduction & Contribution Propose algorithmic innovations for the basic PageRank paradigm. Problem of Web Frontier ( Dangling Nodes) Distinguish different types of Dangling Nodes Propose four techniques for penalty pages Problem of computing pagerank and rank manipulation Explore Web hierarchical structure HostRank & DirRank algorithms

PageRank BackLinks & Random surfer & Recursive computation Ideal Model or The web graph should be strongly connected. A should be stochastic. (irreducible and aperiodic)

PageRank Improved Model Add a link from each page to every page and give each link a small transition probability controlled by a parameter α. Random Jump (teleportation) virtual node n+1 Variations Issues  Parameter α.  Random jump---uniform distribution  Dangling Nodes

Dangling Nodes Dangling nodes: Nodes that either have no outlinks or for which no outlinks are known. How do pages become dangling nodes  Crawlers might not have crawled them. Dynamic Pages.  Protected by a robots.txt  Genuinely have no outlinks: PS, PDF  Meta tag indicating not to follow.

Handling Dangling Nodes Remove away and then added back. Random jump Reduced eigen-system. Power-iteration. A single step

Penalty Pages and Link Rot Penalty pages: pages that are dangling and produce 403 or 404 HTTP code. Link Rot: links used to work but then broken. (Penalty Link, Dangling Link)

Effects of Dangling Nodes on Ranking Whether teleportation to dangling nodes. Yes. 3 has the highest rank score. No. [ , , ], Less than 1and 2. The number of dangling links. 1 link: [ , , , ] 4 links: [ , , , ]

Push-back algorithm If a page has a link to a penalty page, have its rank reduced by a fraction, and the excess rank should be returned to the pages that pushed rank to it in the previous iteration. Retain (1-  i), distribute  i  ij to its backlinks.

Self-Loop algorithm Augment each page with a self-loop link to itself. With a  i probability follow this link. bi is the number of outlinks from i to penalty pages. gi is the number of outlinks from i to non-penalty pages. 1-  becomes Some variations.

Jump-weighting algorithm Instead of evenly redistribution, biasing the redistribution so that penalized pages receive less rank. A straight-forward method  Weight the link from virtual node  to an unpenalized node in C (strongly connected node set) by   to a penalized node by  g i /(g i +b i )

BHITS algorithm Random walk in both Forward/Backward directions. Forward step: the same as ordinary PageRank. Backward step: Non-dangling nodes: self-loop. Dangling nodes: non-penalty nodes: forward score to virtual node. penalty nodes: divide score by # of inlinks. Equally propagate score among backward links. Penalty page traverse to a random seed nodes. Matrix representation

HostRank algorithm Web Hierarchical Structure  62.4% links are internal to a site.  82% outlinks are to the top level of sites. Not jump uniformly, but to portal or Top-level pages. Consider all pages on a site as a single body. Assign them all a rank based on the collective value of information on that site. Each site represented by one node in the graph. Web size becomes smaller. Computation become less.

DirRank algorithm HostRank too coarse a level of granularity & heavy tail distribution. DirRank graph  Node: groups of URLS with prefixes up to the last “/” or “?”. Virtual directory.  Edges: if there is a link from a URL in the source virtual directory to a URL in the destination virtual directory.

Experiments Results Setup:  Crawling on IBM Almaden  More than 1 billion pages; 37 billion links; 4.75 billion URLS. Results: Reduce computation.  DirRank: 114 million nodes/15 billion edges  HostRank: 19.7 billion hosts(nodes)/1.1 billion edges Enhance resistance to link manipulation.  11/20 in 100 million pages. vs 14/100 hostnames  Virtual node probability : 0.82 vs 0.17

Conclusions PageRank with uniform teleportation are easily subject to link manipulation. HostRank and DirRank algorithm are both cheaper to compute and less subject to link manipulation. The proposed 4 techniques for penalty pages can reduce bias and improve ranking performance. In the future, hope can place the problem of web page ranking on a firmer scientific foundation besides on trade or economic domains.