Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.

Slides:



Advertisements
Similar presentations
Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu, Vinay Goel and Brian D. Davison Lehigh University, USA.
Advertisements

Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
TrustRank Algorithm Srđan Luković 2010/3482
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Detecting Web Spam with CombinedRank Abhita Chugh Ravi Tiruvury.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Link Structure and Web Mining Shuying Wang
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Information Retrieval
Web Search – Summer Term 2006 VII. Selected Topics - PageRank (closer look) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Adversarial Information Retrieval The Manipulation of Web Content.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Lecture 2: Extensions of PageRank
Overview of Web Ranking Algorithms: HITS and PageRank
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Ranking Link-based Ranking (2° generation) Reading 21.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.
Combatting Web Spam Dealing with Non-Main-Memory Web Graphs SimRank
WEB SPAM.
HITS Hypertext-Induced Topic Selection
Link Analysis 2 Page Rank Variants
A Comparative Study of Link Analysis Algorithms
CS 440 Database Management Systems
Information retrieval and PageRank
Junghoo “John” Cho UCLA
Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University

The World Wide Web Huge Distributed content creation, linking (no coordination) Structured databases, unstructured text, semi- structured data. Content includes truth, lies, obsolete information, contradictions, …

PageRank Intuition: “a page is important if important pages link to it.” In high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web. (A few fixups needed.)

PageRank Web graph encoded by matrix M –NXN matrix (N = number of web pages) –Mij = 1/|O(j)| iff there is a link from j to i –Mij = 0 otherwise –O(j) = set of pages node i links to Define matrix A as follows –Aij = βMij + (1-β)/N, where 0<β<1 –1-β is the “tax” discussed in prior lecture Page rank r is first eigenvector of A –Ar = r

Many Random Walkers Model Imagine a large number M of independent, identical random walkers (MÀN) At any point in time, let M(p) be the number of random walkers at page p The page rank of p is the fraction of random walkers that are expected to be at page p i.e., E[M(p)]/M.

Economic Considerations Search has become the default gateway to the web Very high premium to appear on the first page of search results –e.g., e-commerce sites –advertising-driven sites

What is Web Spam? Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value Spam = web pages that are the result of spamming This is a very broad defintion –SEO industry might disagree! –SEO = search engine optimization Approximately 10-15% of web pages are spam

Types of Spamming Techniques Term spamming –Manipulating the text of web pages in order to appear relevant to queries Link spamming –Creating link structures that boost page rank or hubs and authorities scores

Link Spam Three kinds of web pages from a spammer’s point of view –Inaccessible pages –Accessible pages e.g., web log comments pages spammer can post links to his pages –Own pages Completely controlled by spammer May span multiple domain names

Link Spam Detection Open research area One approach: TrustRank

Trust Rank Basic principle: approximate isolation –It is rare for a “good” page to point to a “bad” (spam) page Sample a set of “seed pages” from the web. Set trust of each trusted page to 1 Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam

Anti-Trust Approach Broadly based on the same “approximate isolation principle” This principle also implies that the pages pointing to spam pages are very likely to be spam pages themselves. Anti-Trust is propagated in the reverse direction along incoming links, starting from a seed set of spam pages. A page can be classified as a spam page if it has Anti- Trust Rank value more than a chosen threshold value.

Seed Set selection Seed spam set chosen from pages with high page rank. Nearly 100% URLS containing certain terms like {viagra,gambling, hardporn} as substrings are spam. Use these for evaluation. Also some seed pages were chosen by an Oracle (Human Expert).

Results Overall Percentage of “spam” pages =0.28%. Average page rank of “spam”/Average Page Rank = 2.6. % of “spam” pages in: top 1000 Anti-Trust rank pages = 25.3% Bottom 1000 Trust rank pages = 0.68% Ratio of average page ranks of spam pages returned by ATR vs. TR is roughly 6.

Results

References The PageRank citation ranking: Bringing order to the web. L. Page, S. Brin, R. Motwani and T. Winograd. Technical Report, Stanford University, Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. In VLDB Topic-sensitive PageRank. Taher Haveliwala. In WWW The WebGraph dataset. Online at: