Detecting Spam Web Pages through Content Analysis

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
TrustRank Algorithm Srđan Luković 2010/3482
Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Search Engines and Information Retrieval
CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley.
Chapter 5: Information Retrieval and Web Search
Search Engine Optimization. What is SEO? Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search.
WEB SPAM A By-Product Of The Search Engine Era Web Enhanced Information Management Aniruddha Dutta Department of Computer Science Columbia University.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Web Spam Detection: link-based and content-based techniques Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/11/8 1.
Search Engines and Information Retrieval Chapter 1.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
Automatically Identifying Localizable Queries Center for E-Business Technology Seoul National University Seoul, Korea Nam, Kwang-hyun Intelligent Database.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Heuristics for Detecting Spam Web Pages Marc Najork Microsoft Research, Silicon Valley Joint work with Fetterly, Manasse, Ntoulas.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Amy Dai Machine learning techniques for detecting topics in research papers.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Know your Neighbors: Web Spam Detection using the Web Topology By Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock and Fabrizio Silvestri.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
A Kosher Source of Ham Nathan Friess John Aycock Department of Computer Science University of Calgary Canada.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Algorithmic Detection of Semantic Similarity WWW 2005.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Graph Algorithms: Classification William Cohen. Outline Last week: – PageRank – one algorithm on graphs edges and nodes in memory nodes in memory nothing.
Bayesian Filtering Team Glyph Debbie Bridygham Pravesvuth Uparanukraw Ronald Ko Rihui Luo Thuong Luu Team Glyph Debbie Bridygham Pravesvuth Uparanukraw.
Class Imbalance in Text Classification
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
Language Model for Machine Translation Jang, HaYoung.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Dec 14, 2014, Harvard University
Measuring Monolinguality
WEB SPAM.
Source: Procedia Computer Science(2015)70:
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Text Categorization Rong Jin.
iSRD Spam Review Detection with Imbalanced Data Distributions
Web Spam
Chapter 5: Information Retrieval and Web Search
Junghoo “John” Cho UCLA
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
NAÏVE BAYES CLASSIFICATION
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Detecting Spam Web Pages through Content Analysis Alex Ntoulas1, Marc Najork2, Mark Manasse2, Dennis Fetterly2 1 UCLA Computer Science Department 2 Microsoft Research, Silicon Valley

Web Spam: raison d’etre E-commerce is rapidly growing Projected to $329 billion by 2010 More traffic  more money Large fraction of traffic from Search Engines Increase Search Engine referrals: Place ads  Provide genuinely better content  Create Web spam …  9 December 2018 WWW 2006

Web Spam (you know it when you see it) * Giving an exact definition at Web spam is not an easy task. Essentially we know web spam when we see it. Here are some examples. 9 December 2018 WWW 2006

Defining Web Spam Spam Web page: A page created for the sole purpose of attracting search engine referrals (to this page or some other “target” page) Ultimately a judgment call Some web pages are borderline cases * Here is a somewhat more formal definition 9 December 2018 WWW 2006

Why Web Spam is Bad Bad for users Bad for search engines Makes it harder to satisfy information need Leads to frustrating search experience Bad for search engines Wastes bandwidth, CPU cycles, storage space Pollutes corpus (infinite number of spam pages!) Distorts ranking of results 9 December 2018 WWW 2006

How pervasive is Web Spam? Dataset Real-Web data from the MSNBot crawler Collected during August 2004 Processed only MIME types text/html text/plain 105,484,446 Web pages in total 9 December 2018 WWW 2006

Spam per Top-level Domain 95% confidence 9 December 2018 WWW 2006

Spam per Language 95% confidence 9 December 2018 WWW 2006

Detecting Web Spam (1) We report results for English pages only (analysis is similar for other languages) ~55 million English Web pages in our collection Manual inspection of a random sample of the English pages Sample size: 17,168 Web pages 2,364 (13.8%) spam 14,804 (86.2%) non-spam 9 December 2018 WWW 2006

Detecting Web Spam (2) Spam detection: A classification problem Given salient features of a Web page, decide whether the page is spam Which “salient features”? Need to understand spamming techniques to decide on features Finding right features is “alchemy”, not science We focus on features that Are fast to compute Are local (i.e. examine a page in isolation) 9 December 2018 WWW 2006

Number of Words in <title> 9 December 2018 WWW 2006

Distribution of Word-counts in <title> Spam more likely in pages with more words in title 9 December 2018 WWW 2006

Visible Content of a Page 9 December 2018 WWW 2006

Visible Content of a Page 9 December 2018 WWW 2006

Distribution of Visible-content fractions Spam not likely in pages with little visible content 9 December 2018 WWW 2006

Compressibility of a Page 9 December 2018 WWW 2006

zipRatio of a page 9 December 2018 WWW 2006

Distribution of zipRatios Spam more likely in pages with high zipRatio 9 December 2018 WWW 2006

“Obscure” Content very rare words 9 December 2018 WWW 2006

Independence Likelihood Model Consider all n-grams wi+1…wi+n of a Web page, with k n-grams in total Probability that n-gram occurs in collection: Independence likelihood model Low IndepLH  high probability of page’s existence 9 December 2018 WWW 2006

Distribution of 3-gram Likelihoods (Independence) Pages with high & low likelihoods are spam Longer n-grams seem to help 9 December 2018 WWW 2006

Conditional Likelihood Model Consider all n-grams wi+1…wi+n of a Web page, with k n-grams in total Probability that n-gram wi+1…wi+n occurs, given that (n-1)-gram wi+1…wi+n-1 occurs Conditional likelihood model Low CondLH  high probability of page’s existence 9 December 2018 WWW 2006

Distribution of 3-gram Likelihoods (Conditional) Pages with high & low likelihoods are spam Longer n-grams seem to help 9 December 2018 WWW 2006

Spam Detection as a Classification Problem Use the previously presented metrics as features for a classifier Use the 17,168 manually tagged pages to build a classifier We show results for a decision-tree – C4.5 (other classifiers performed similarly) 9 December 2018 WWW 2006

Putting it all together size of the page static rank link depth number of dots/dashes/digits in hostname hostname length hostname domain number of words in the page number of words in the title fraction of anchor text average length of the words fraction of visible content fraction of top-100,200,500,1000 words in the text fraction of text in top-100,200,500,1000 words compression ratio (zipRatio) occurrence of the phrase “Privacy Policy” occurrence of the phrase “Privacy Statement” occurrence of the phrase “Customer Service” occurrence of the word “Disclaimer” occurrence of the word “Copyright” occurrence of the word “Fax” occurrence of the word “Phone” likelihood (independence) 1,2,3,4,5-grams likelihood (conditional probability) 2,3,4,5-grams 9 December 2018 WWW 2006

A Portion of the Induced Decision Tree 9 December 2018 WWW 2006

Evaluation of the Classifier We evaluated the classifier using 10-fold cross validation (and random splits) Training samples: 17,168 2,364 spam 14,804 non-spam Correctly classified: 16,642 (96.93 %) Incorrectly classified: 526 ( 3.07 %) 9 December 2018 WWW 2006

Confusion Matrix & Precision/Recall classified as  spam non-spam 1,940 424 366 14,440 class recall precision spam 82.1% 84.2% non-spam 97.5% 97.1% 9 December 2018 WWW 2006

Bagging & Boosting class recall precision spam 84.4% 91.2% non-spam 98.7% 97.5% After bagging class recall precision spam 86.2% 91.1% non-spam 98.7% 97.8% After boosting 9 December 2018 WWW 2006

Related Work Link spam Content spam Cloaking e-mail spam Z. Gyongyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. [VLDB 2004] B. Wu and B. Davison. Identifying Link Farm Spam Pages. [WWW 2005] B. Wu, V. Goel and B. Davison. Topical Trustrank: Using Topicality to Combat Web Spam. [WWW 2006] A.L. da Costa Carvalho, P.A. Chirita, E.S. de Moura, P. Calado, W. Nejdl. Site Level Noise Removal for Search Engines. [WWW 2006] R. Baeza-Yates, C. Castillo and V. Lopez. PageRank Increase under Different Collusion Topologies. [AIRWeb 2005] Content spam G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement.[AIRWeb 2005] D. Fetterly, M. Manasse and M. Najork. Spam, Damn Spam, and Statistics. [WebDB 2004] D. Fetterly, M. Manasse and M. Najork. Detecting Phrase-Level Duplication on the World Wide Web. [SIGIR 2005] Cloaking Z. Gyongyi and H. Garcia-Molina. Web Spam Taxonomy. [AIRWeb 2005] B. Wu and B. Davison. Cloaking and Redirection: a preliminary study.[AIRWeb 2005] e-mail spam J. Hidalgo. Evaluating cost-sensitive Unsolicited Bulk Email categorization. [SAC 2002] M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. [AAAI 1998] 9 December 2018 WWW 2006

Conclusion Studied properties of the spam pages (per top-level domain and language) Implemented and evaluated a variety of spam detection techniques Combined the techniques into a decision-tree classifier We can identify ~86% of all spam with ~91% accuracy 9 December 2018 WWW 2006

Thank you