TrustRank Algorithm Srđan Luković 2010/3482

Slides:

Advertisements

Similar presentations

Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu, Vinay Goel and Brian D. Davison Lehigh University, USA.

Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.

Analysis of Large Graphs: TrustRank and WebSpam

Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.

CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis, PageRank and Search Engines on the Web

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.

CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.

Google and the Page Rank Algorithm Székely Endre

IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.

Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Adversarial Information Retrieval The Manipulation of Web Content.

Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.

Reliability & Desirability of Data

Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.

Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)

COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.

Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-

Lecture 2: Extensions of PageRank

Link Analysis in Web Mining Hubs and Authorities Spam Detection.

Scaling Personalized Web Search Authors: Glen Jeh, Jennfier Widom Stanford University Written in: 2003 Cited by: 923 articles Presented by Sugandha Agrawal.

Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Algorithmic Detection of Semantic Similarity WWW 2005.

Ranking CSCI 572: Information Retrieval and Search Engines Summer 2010.

Search Engine and SEO Presented by Yanni Li. Various Components of Search Engine.

CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

A Comparative Study of Link Analysis Algorithms

PageRank algorithm based on Eigenvectors

Junghoo “John” Cho UCLA

Presentation transcript:

TrustRank Algorithm Srđan Luković 2010/3482

Introduction The web is huge and is likely to get bigger! Search engines dominate the Web business Some people try hard to trick search engines to fulfill this goal That means more bad (SPAM) pages Pages created to fool search engines – SPAMDEXING. Search engine business depends on successful filtering of SPAM pages 2 of 21

Some Spamming Techniques Changing a color scheme for keyword stuffing: e.g. white text on white background Link farms - Creating a number of bogus web pages that link to one page in order to give it a better rank Honey Pot - Provide some valuable content but contain links to SPAM pages Registering many keyword-rich domains Scraper sites – Scrape content from search engines and other websites Article spinning – Rewriting existing articles to escape duplicate content penalty Cloaking – Serving the page differently to the crawler than to the humans 3 of 21

How do we filter out SPAM? How do we separate the “good” ones from the “bad” ones? It turns out its hard to do it automatically The most reliable way is to use human experts but how do we do that for billions of pages? Is there a way to conclude something based on a small set of pages reviewed by experts? 4 of 21

Problem definition How can we semi-automatically estimate which pages are good and which ones are bad (SPAM) provided that we have a limited number of experts? Can we reliably say which ones are probably good and/or probably bad based on a small “seed” of pages reviewed by experts? How can we do it effectively and efficiently? 5 of 21

Basic Notions Web Graph: Indegree l(p) – the number of inlinks of a page p Outdegree ω(p) - the number of outlinks of a page p Transition matrix T Inverse transition matrix U 6 of 21

1234 Finding T and U matrices 7 of 21

PageRank Assigns global importance score to each page on the web by analyzing link information Rationale: the number of inlinks to a webpage determines its importance The score propagates via links to other webpages Each link coming to a page is a vote Problems –Doesn’t incorporate any knowledge about a webpage –Doesn’t discern good pages from bad ones 8 of 21

PageRank TrustRank relies on PageRank Scores can be computed iteratively, usually in a fixed number of M iterations Biased PageRank assigns a non-zero static score to special pages and it gets distributed to others during computation There are very efficient ways to calculate PageRank 9 of 21

Oracle and Trust Function The notion of human checking of a web page is represented by Oracle function: Oracle invocations are expensive, one should strive to minimize them Important empirical observation for trust: approximate isolation of the good set Good pages rarely link to bad ones The converse does not hold 10 of 21

To evaluate the pages without calling O, it is necessary to estimate the probability that p is good The Trust function yields a range of values between 0 (bad) and 1 (good) Ideally, This is hardly ever true in practice A relaxed constraint is orderedness by pair, so that we can display search results based on that order 11 of 21

If a page receives a score above δ we know it is good. Otherwise, we cannot say anything This does not necessarily provide ordering based on the likelihood of being good Another method of relaxing the requirements of T is introducing a threshold value 12 of 21

Seed selection Random selection is simplest but it may hinder TrustRank effectiveness Oracle invocations are expensive, because they require human effort Chosen pages should be useful in identifying additional good pages Seed set should be reasonably small to limit oracle invocations 13 of 21

Trust flows out of the good seed pages, so give preference to pages which reach many other pages Select pages based on the number of outlinks Scheme closely related to PageRank Difference is that importance depends on outlinks not inlinks Inverse PageRank – compute PageRank on inversed transition matrix Reuse of existing algorithms and their good performance 14 of 21

Trust propagation Expecting that good pages point to other good pages, all pages reachable from a good seed page in M or fewer steps are denoted as good good page bad page 15 of 21

S = {1, 3, 6} set of seed pages M = 1..3 maximum length path of 21

Trust attenuation We cannot be absolutely sure that pages reachable from good seeds are indeed good Further away we are from good seed, less certain we are that a page is good Trust dampening β – dampening factor Trust splitting Can be combined 123 β β β β2β2 β2β /2 t(1)=1 t(2)=1 1/3 1/2 1/3 t(3)=5/6 5/12 17 of 21

TrustRank in Action Select seed set using inversed PageRank =[2, 4, 5, 1, 3, 6, 7] Invoke L(=3) oracle functions Populate static score distribution vector d=[0, 1, 0, 1, 0, 0, 0] Normalize distribution vector d=[0, 1/2, 0, 1/2, 0, 0, 0] Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting RESULTS [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05] of 21

Conclusions These results only give us the order of the pages That is good enough for search results We can use threshold value to determine only good pages Page 1 is non-referenced, so it doesn’t have rank Spam page 5 is highly rated because of inlink from good page On real world web graph these exceptions are rare 19 of 21

Further improvements During the seed selection it is not necessary to order all pages using inversed PageRank If we already have PageRank score, we can choose subset of highly ranked pages Users will be more interested in those pages anyway We may determine TrustRank of smaller number of pages, but those will be more important 20 of 21

References [1] Z. Gyongyi, H. Garcia-Molina and J.Pedersen. Combating Web Spam with TrustRank. Tech. rep., Stanford University, [2] Z. Gyongyi and H. Garcia-Molina. Seed selection in TrustRank. Tech. rep., Stanford University, of 21