The Efficacy of Collusions in Web Ranking and the Countermeasurements

Slides:



Advertisements
Similar presentations
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Link Structure and Web Mining Shuying Wang
CS522: Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian
Google and the Page Rank Algorithm Székely Endre
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Google’s PageRank: The Math Behind the Search Engine Author:Rebecca S. Wills, 2006 Instructor: Dr. Yuan Presenter: Wayne.
Presented By: - Chandrika B N
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
Adversarial Information Retrieval The Manipulation of Web Content.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Overview of Web Ranking Algorithms: HITS and PageRank
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
CS 590 Term Project Epidemic model on Facebook
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Automated Information Retrieval
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
The PageRank Citation Ranking: Bringing Order to the Web
Search Engines and Link Analysis on the Web
Link-Based Ranking Seminar Social Media Mining University UC3M
DTMC Applications Ranking Web Pages & Slotted ALOHA
ALGORITHMS FOR PERFORMANCE AND TRUST IN PEER-TO-PEER SYSTEMS
PERFORMANCE AND REPUTATION IN University of Southern California
PageRank & Random Walk “The important of a Web page is depends on the readers interest, knowledge and attitudes…” –By Larry Page, Co-Founder of Google.
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Making Eigenvector-based Reputation Systems Robust to Collusion
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Discovery of Blog Communities based on Mutual Awareness
GANG: Detecting Fraudulent Users in OSNs
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

The Efficacy of Collusions in Web Ranking and the Countermeasurements Hui Zhang University of Southern California

Outline Problem Statement. PageRank algorithm : a brief introduction. 10/14/2018 Outline Problem Statement. PageRank algorithm : a brief introduction. Study of PageRank’s robustness to collusion. Adaptive-resetting: make PageRank robust to collusion. Conclusions. 10/14/2018 USC CS599 P2Peco

Search Engine Optimization (SEO) 10/14/2018 Search Engine Optimization (SEO) Not different from other research works on P2P rating, our research goal 10/14/2018 USC CS599 P2Peco

Web spam [Gyongyin et al. 2004] 10/14/2018 Web spam [Gyongyin et al. 2004] Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. A spammer will play with two factors which decide the rank score of a page in a query: Relevance – textual similarity between the query and a page. Importance – the global popularity of a page, which is query-independent. Not different from other research works on P2P rating, our research goal 10/14/2018 USC CS599 P2Peco

Collusion in Web ranking 10/14/2018 Collusion in Web ranking A manipulation of the hyperlink structure by a group of users with the intention of improving the rating one or more users in the group. Not different from other research works on P2P rating, our research goal 10/14/2018 USC CS599 P2Peco

10/14/2018 PageRank [Brin1998] An eigenvector-based rating scheme to rank hypertext documents on the WWW. An iterative algorithm to calculate the importance of a web page based on the importance of its parent pages. Can be applied to other systems than WWW. 10/14/2018 USC CS599 P2Peco

PageRank: random walk model 10/14/2018 PageRank: random walk model With prob. (1-), I will continue the walk to a random successor node. : resetting probability node With prob. , I will restart the walk at a random node. : resetting probability referential link The walker X 1/2 1/3 Y Z As time goes on, the expected percentage of steps the walker is at each node v converges to the PageRank weight PR(v). 10/14/2018 USC CS599 P2Peco

PageRank: is it collusion-proof? 10/14/2018 PageRank: is it collusion-proof? Can a node easily boost its rank by manipulating its out-going links with others’? I’m not colluding! 10/14/2018 USC CS599 P2Peco

Amp(G): a metric on group collusion 10/14/2018 Amp(G): a metric on group collusion x y G G’ i j : resetting probability WG(G’) =PR(i)+PR(j) real group weight PR(x) 3 (1-) PR(y) 2 4 + (1-) Win(G’) = + 2 N (1-W(G’)) “actual” group weight In the system of node group G, for a subgroup G’, the amplification factor Amp(G’) = 10/14/2018 USC CS599 P2Peco

Answer for (1+1 = ?) in PageRank 10/14/2018 Answer for (1+1 = ?) in PageRank In the original PageRank system, where  is the resetting probability. 10/14/2018 USC CS599 P2Peco

Two experimental topologies 10/14/2018 Two experimental topologies W, a Web link topology Contains the link structure of upwards of 80 million URLs. Source: the Stanford WebBase. B, a weblog blogrolling topology Contains the blogrolling structure of upwards of 72,000 blogs. Source: www.blogstreet.com, the XML-RPC webblog service. 10/14/2018 USC CS599 P2Peco

Experiment 1: Collusion200 10/14/2018 Experiment 1: Collusion200 Model a small number of web pages simultaneously colluding. Methodology: 100 colluding groups of 200 nodes; Each colluding group has the circle topology consisting of two nodes with adjacent ranks; Arbitrarily chose node pairs originally ranked around 1000th, 2000th, …, 100000th.  = 0.15. (100th, 200th, …, 10000th for B due to the smaller graph size) 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (I) 10/14/2018 Experiment result of Collusion200 (I) Figure 1: W - Amplification factors of the 100 colluding groups in Collusion200. 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (III) 10/14/2018 Experiment result of Collusion200 (III) Old rank: 100009th New rank: 5038th Old rank: 10001th New rank: 450th Old rank: 1005th New rank: 67th Figure 2: W – new PR rank after Collusion200. 10/14/2018 USC CS599 P2Peco

There is a long flat portion… 10/14/2018 There is a long flat portion… Figure 3: The PR weight distribution of 4 topologies. 10/14/2018 USC CS599 P2Peco

Next step: how to detect collusions? 10/14/2018 Next step: how to detect collusions? Identifying colluding groups is unlikely to be computationally tractable. The densest k-subgraph problem[Feige et al. 1997]. The classical CLIQUE problem. The problem of finding hiding large cliques in random graphs[Juels 1998]. 10/14/2018 USC CS599 P2Peco

Hardness on Amp Theorem on Hardness. 10/14/2018 Hardness on Amp Theorem on Hardness. Max G’G Amp(G’) is a NP-Hard problem. 10/14/2018 USC CS599 P2Peco

How about using finer statistics of the random walk 10/14/2018 How about using finer statistics of the random walk The revisit intervals of the random walk on a colluding node will likely to have a large variance compared to its expectation. Figure E: A counterexample: a star+dangling circle topology 1 2 N N+1 N-1 N-2 10/14/2018 USC CS599 P2Peco

An observation on collusion behaviors 10/14/2018 An observation on collusion behaviors To increase their PR weight, i.e., the stationary weight in the random walk, the colluding nodes will stall the random walk. G G’ When the resetting probability  increases, the colluding nodes must suffer a significant drop in PR weight. Therefore, we expect the PR weight of colluding nodes to be highly correlated with 1/  (the average walk length), while that of non-colluding nodes is relatively insensitive to the change in . 10/14/2018 USC CS599 P2Peco

An intuitive example node referential link 10/14/2018 USC CS599 P2Peco

An intuitive example node referential link A colluding group 10/14/2018 An intuitive example node referential link A colluding group 10/14/2018 USC CS599 P2Peco

10/14/2018 An intuitive example A colluding node x: PR(x) = , and co-co(PR(x), 1/ )  1. (co-co: correlation coefficient) A non-colluding node y: PR(x) = , and co-co(PR(y), 1/ )  0. x y N: the system size; K: the colluding group size; K << N. node referential link A colluding group 10/14/2018 USC CS599 P2Peco

Adaptive-resetting scheme 10/14/2018 Adaptive-resetting scheme Part I – collusion detection: Given the topology, calculate the PR vector under different  values. {} = {0.0375, 0.05, 0.075, 0.15, 0.3, 0.45, 0.6}, default = 0.15. Calculate the correlation coefficient between the curve of each node x's PR weight and the curve of 1/ . Label it as co-co(x). Part II –  personalization: Calculate each node x's out-link personalized- = F(default, co-co(x)). Exponential function FExp= . Linear function FLinear= default+(0.5-default)*co-co(x) The final PR weight vector is calculated with these personalized resetting values. 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (IV) 10/14/2018 Experiment result of Collusion200 (IV) Figure 5: W - Amplification factors of the 100 colluding groups in Collusion200. 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (VI) 10/14/2018 Experiment result of Collusion200 (VI) Figure 6: W – new PR rank after Collusion200. 10/14/2018 USC CS599 P2Peco

Experiment 2: Collusion22 10/14/2018 Experiment 2: Collusion22 Model various colluding subgraphs. Methodology: 3 colluding groups: node referential link (100th, 200th, …, 10000th for B due to the smaller graph size) G1: 10-node ring G2: 10-node star topology G3: 2-node ring 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion22 (I) 10/14/2018 Experiment result of Collusion22 (I) Figure 7: Amplification factors of the 3 colluding groups in Collusion22. 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion22 (II) 10/14/2018 Experiment result of Collusion22 (II) Figure 8: W – new PR weight after Collusion22. 10/14/2018 USC CS599 P2Peco

New top-25 URL list in W Dropped out Dropping New 10/14/2018 USC CS599 P2Peco

10/14/2018 Conclusions Simple collusions lead to effective Web ranking improvement. A simple scheme based on PageRank algorithm effectively counteracts Web ranking collusions. 10/14/2018 USC CS599 P2Peco

Backup slides 10/14/2018 USC CS599

Reputation systems [Okita2003] 10/14/2018 Reputation systems [Okita2003] A means of describing social trust networks. The basic concept is a democratic meritocracy. A rating system is used to evaluate individual members, and those results are then collated to produce a consensus about the merit of any given member. Examples: Livejournal, Friendster, eBay, Advogato 10/14/2018 USC CS599 P2Peco

PageRank algorithm [Brin1998] 10/14/2018 PageRank algorithm [Brin1998] Assume N pages. Assign all pages the initial value 1/N Let Nu be the out-degree of Page u, Rank(v) the importance of Page v, Bv the set of pages pointing to v. Basic algorithm v Rank(v) = Enhanced algorithm against rank sinks v Rank(v) = : damping factor 10/14/2018 USC CS599 P2Peco

Co-co distribution in real-world graphs 10/14/2018 Figure 4: the co-co PDF distribution in W and B: the [0, 0.1] range actually corresponds to [-1, 0.1] range. 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (II) 10/14/2018 Experiment result of Collusion200 (II) Figure A: W – new PR weight after Collusion200. 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (VII) 10/14/2018 Experiment result of Collusion200 (VII) Figure B: B – new PR rank after Collusion200 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (X) 10/14/2018 Experiment result of Collusion200 (X) Figure C: B – new PR weight after Collusion200 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion200 (V) 10/14/2018 Experiment result of Collusion200 (V) Figure 6: W – new PR weight after Collusion200. 10/14/2018 USC CS599 P2Peco

Correlation coefficient 10/14/2018 Correlation coefficient 10/14/2018 USC CS599 P2Peco

Experiment result of Collusion22 (III) 10/14/2018 Experiment result of Collusion22 (III) Figure D: W – new PR rank after Collusion22. 10/14/2018 USC CS599 P2Peco