CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

Slides:



Advertisements
Similar presentations
TrustRank Algorithm Srđan Luković 2010/3482
Advertisements

What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Search Engine Optimization By Andy Smith | Art Institute of Dallas.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Information Retrieval
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Increasing Website ROI through SEO and Analytics Dan Belhassen greatBIGnews.com Modern Earth Inc.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Designing for Search Engines MIS 314 MIS 314 Professor Sandvig Professor Sandvig.
1 Link Analysis and spam Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan –
Adversarial Information Retrieval The Manipulation of Web Content.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2015 Lecture 8: Information Retrieval II Aidan Hogan
Web Spam Yonatan Ariel SDBI 2005 Based on the work of Gyongyi, Zoltan; Berkhin, Pavel; Garcia-Molina, Hector; Pedersen, Jan Stanford University The Hebrew.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web spamming Detecting Spam Web Pages through Content Analysis Alexandros Ntoulas et al, 2006, International World Wide Web Conference.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
SEO techniques & Mastering Google Adwords By Ganesh.S
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Lecture 2: Extensions of PageRank
Search Engine Optimization: A Survey of Current Best Practices Author - Niko Solihin Resource -Grand Valley State University April, 2013 Professor - Soe-Tsyr.
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
Basic Search Engine Optimization. What is SEO?  SEO is an abbreviation for search engine optimization.
What Is SEO? Search engine optimization (SEO) is the art and science of publishing and marketing information that ranks well for valuable keywords in.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
CC P ROCESAMIENTO M ASIVO DE D ATOS O TOÑO 2014 Aidan Hogan Lecture IX: 2014/05/05.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Week 5  SEO  CSS Please Visit: to download all the PowerPoint Slides for.
TrustRank. 2 Observation – Good pages tend to link good pages. – Human is the best spam detector Algorithm – Select a small subset of pages and let a.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.
Link Analysis and spam Slides adapted from
Combatting Web Spam Dealing with Non-Main-Memory Web Graphs SimRank
CCT356: Online Advertising and Marketing
WEB SPAM.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Web Spam
Junghoo “John” Cho UCLA
Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Information Retrieval and Web Design
Presentation transcript:

CS345 Data Mining Web Spam Detection

Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the first page of search results e.g., e-commerce sites advertising-driven sites

What is web spam?  Spamming = any deliberate action solely in order to boost a web page’s position in search engine results, incommensurate with page’s real value  Spam = web pages that are the result of spamming  This is a very broad defintion SEO industry might disagree! SEO = search engine optimization  Approximately 10-15% of web pages are spam

Web Spam Taxonomy  We follow the treatment by Gyongyi and Garcia-Molina [2004]  Boosting techniques Techniques for achieving high relevance/importance for a web page  Hiding techniques Techniques to hide the use of boosting  From humans and web crawlers

Boosting techniques  Term spamming Manipulating the text of web pages in order to appear relevant to queries  Link spamming Creating link structures that boost page rank or hubs and authorities scores

Term Spamming  Repetition of one or a few specific terms e.g., free, cheap, viagra Goal is to subvert TF.IDF ranking schemes  Dumping of a large number of unrelated terms e.g., copy entire dictionaries  Weaving Copy legitimate pages and insert spam terms at random positions  Phrase Stitching Glue together sentences and phrases from different sources

Term spam targets  Body of web page  Title  URL  HTML meta tags  Anchor text

Link spam  Three kinds of web pages from a spammer’s point of view Inaccessible pages Accessible pages  e.g., web log comments pages  spammer can post links to his pages Own pages  Completely controlled by spammer  May span multiple domain names

Link Farms  Spammer’s goal Maximize the page rank of target page t  Technique Get as many links from accessible pages as possible to target page t Construct “link farm” to get page rank multiplier effect

Link Farms Inaccessible t AccessibleOwn 1 2 M One of the most common and effective organizations for a link farm

Analysis Suppose rank contributed by accessible pages = x Let page rank of target page = y Rank of each “farm” page = y/M + (1-)/N y = x + M[y/M + (1-)/N] + (1-)/N = x +  2 y + (1-)M/N + (1-)/N y = x/(1- 2 ) + cM/N where c = /(1+) Inaccessibl e t Accessible Own 1 2 M Very small; ignore

Analysis  y = x/(1- 2 ) + cM/N where c = /(1+)  For  = 0.85, 1/(1- 2 )= 3.6 Multiplier effect for “acquired” page rank By making M large, we can make y as large as we want Inaccessibl e t Accessible Own 1 2 M

Hiding techniques  Content hiding Use same color for text and page background  Cloaking Return different page to crawlers and browsers  Redirection Alternative to cloaking Redirects are followed by browsers but not crawlers

Detecting Spam  Term spamming Analyze text using statistical methods e.g., Naïve Bayes classifiers Similar to spam filtering Also useful: detecting approximate duplicate pages  Link spamming Open research area One approach: TrustRank

TrustRank idea  Basic principle: approximate isolation It is rare for a “good” page to point to a “bad” (spam) page  Sample a set of “seed pages” from the web  Have an oracle (human) identify the good pages and the spam pages in the seed set Expensive task, so must make seed set as small as possible

Trust Propagation  Call the subset of seed pages that are identified as “good” the “trusted pages”  Set trust of each trusted page to 1  Propagate trust through links Each page gets a trust value between 0 and 1 Use a threshold value and mark all pages below the trust threshold as spam

Example good bad

Rules for trust propagation  Trust attenuation The degree of trust conferred by a trusted page decreases with distance  Trust splitting The larger the number of outlinks from a page, the less scrutiny the page author gives each outlink Trust is “split” across outlinks

Simple model  Suppose trust of page p is t(p) Set of outlinks O(p)  For each q 2 O(p), p confers the trust t(p)/|O(p)| for 0<<1  Trust is additive Trust of p is the sum of the trust conferred on p by all its inlinked pages  Note similarity to Topic-Specific Page Rank Within a scaling factor, trust rank = biased page rank with trusted pages as teleport set

Picking the seed set  Two conflicting considerations Human has to inspect each seed page, so seed set must be as small as possible Must ensure every “good page” gets adequate trust rank, so need make all good pages reachable from seed set by short paths

Approaches to picking seed set  Suppose we want to pick a seed set of k pages  PageRank Pick the top k pages by page rank Assume high page rank pages are close to other highly ranked pages We care more about high page rank “good” pages

Inverse page rank  Pick the pages with the maximum number of outlinks  Can make it recursive Pick pages that link to pages with many outlinks  Formalize as “inverse page rank” Construct graph G’ by reversing each edge in web graph G Page Rank in G’ is inverse page rank in G  Pick top k pages by inverse page rank