Comment Spam Identification Eric Cheng & Eric Steinlauf.

Slides:



Advertisements
Similar presentations
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Design and Evaluation of a Real-Time URL Spam Filtering Service
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Victor Ivanov. Introduction  Definition  Unsolicited bulk messages  Concerns  Server load  Garbage content.
Probabilistic inference
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Spam Filtering Using Bayesian Approach Presented by: Nitin Kumar.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
How does computer know what is spam and what is ham?
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Detecting Spammers with SNARE: Spatio-temporal Network-level Automatic Reputation Engine Shuang Hao, Nadeem Ahmed Syed, Nick Feamster, Alexander G. Gray,
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
6/28/2014 CSE651C, B. Ramamurthy1.  Classification is placing things where they belong  Why? To learn from classification  To discover patterns  To.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Python & Web Mining Old Dominion University Department of Computer Science Hany SalahEldeen CS495 – Python & Web Mining Fall 2012 Lecture 5 CS 495 Fall.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Blocking Blog Spam with Language Model Disagreement Gilad Mishne (Amsterdam) David Carmel (IBM Israel) AIRWeb 2005.
Bayesian Spam Filter By Joshua Spaulding. Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity.
Naive Bayes Classifier Christopher Gonzalez. Outline Bayes’ Theorem What is a Naive Bayes Classifier (NBC)? Why/when to use NBC? How does NBC work? Applications.
AP STATISTICS LESSON 6.3 (DAY 1) GENERAL PROBABILITY RULES.
CDA6530: Performance Models of Computers and Networks Chapter 1: Review of Practical Probability TexPoint fonts used in EMF. Read the TexPoint manual before.
Spam Detection Ethan Grefe December 13, 2013.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
IR, IE and QA over Social Media Social media (blogs, community QA, news aggregators)  Complementary to “traditional” news sources (Rathergate)  Grow.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Wikispam, Wikispam, Wikispam PmWiki Patrick R. Michaud, Ph.D. March 4, 2005.
CDA6530: Performance Models of Computers and Networks Chapter 1: Review of Practical Probability TexPoint fonts used in EMF. Read the TexPoint manual before.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Language Model for Machine Translation Jang, HaYoung.
Adversarial Information System Tanay Tandon Web Enhanced Information Management April 5th, 2011.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Knowledge Hub Walkthrough August
Knowledge Hub Walkthrough August
Naïve Bayes CSE651C, B. Ramamurthy 6/28/2014.
Chapter 6: Community Features.
Lecture 15: Text Classification & Naive Bayes
Machine Learning. k-Nearest Neighbor Classifiers.
Naïve Bayes CSE651 6/7/2014.
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Building a Naive Bayes Text Classifier with scikit-learn
Naïve Bayes Classifiers
Naïve Bayes CSE487/587 Spring2017 4/4/2019.
Mark Chavira Ulises Robles
Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara
Presentation transcript:

Comment Spam Identification Eric Cheng & Eric Steinlauf

What is comment spam?

Total spam:1,226,026,178 Total ham:62,723,306 95% are spam! Source: Retrieved 4/22/2007http://akismet.com/stats/

Countermeasures

Blacklisting 5yx.org 9kx.com aakl.com aaql.com aazl.com abcwaynet.com abgv.com abjg.com ablazeglass.com abseilextreme.net actionbenevole.com acvt.com adbx.com adhouseaz.com advantechmicro.com aeur.com aeza.com agentcom.com ailh.org akbu.com alaskafibre.com alkm.com alqx.com alumcasting-eng-inc.co! americanasb.com amwayau.com amwaynz.com amwaysa.com amysudderjoy.com anfb.com anlusa.net aobr.com aoeb.com apoctech.com apqf.com areagent.com artstonehalloweencostumes.com globalplasticscrap.com gowest-veritas.com greenlightgo.org hadjimitsis.com healthcarefx.com herctrade.com hobbyhighway.com hominginc.com hongkongdivas.com hpspyacademy.com hzlr.com idlemindsonline.com internetmarketingserve.com jesh.org jfcp.com jfss.com jittersjapan.com jkjf.com jkmrw.com jknr.com jksp.com jkys.com jtjk.com justfareed.com justyourbag.com kimsanghee.org kiosksusa.com knivesnstuff.com knoxvillevideo.com ksj! kwscuolashop.com lancashiremcs.com lnjk.com localmediaaccess.com lrgww.com marketing-in-china.com rockymountainair.org rstechresources.com samsung-integer.com sandiegonhs.org screwpile.org scvend.org sell-in-china.com sensationalwraps.com sevierdesign.com starbikeshop.com struthersinc.com swarangeet.com thecorporategroup.net thehawleyco.com thehumancrystal.com thinkaids.org thisandthatgiftshop.net thomsungroup.com ti0.org timeby.net tradewindswf.com tradingb2c.com turkeycogroup.net vassagospalace.com vyoung.net web-toggery.com webedgewars.com webshoponsalead.com webtoggery.com willman-paris.com worldwidegoans.com

Captchas "Completely Automated Public Turing test to tell Computers and Humans Apart"

Other ad-hoc/weak methods Authentication / registration Comment throttling Disallowing links in comments Moderation

Our Approach – Naïve Bayes Statistical Adaptive Automatic Scalable and extensible Works well for spam

Naïve Bayes

P(A|B) ∙ P(B)= P(B|A) ∙ P(A)= P(AB)

P(A|B) ∙ P(B)= P(B|A) ∙ P(A)

P(A|B) = P(B|A) ∙ P(A) / P(B)

P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)

P(spam|comment) = P(comment|spam) ∙ P(spam) / P(comment)

P(spam|comment) = P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam) / P(comment) (naïve assumption) Probability of w 1 occurring given a spam comment

P(w 1 |spam) = 1 – (1 – x/y) n Probability of w 1 occurring given a spam comment where x is the number of times w 1 appears in all spam messages, y is the total number of words in all spam messages, and n is the length of the given comment Texas casino Online Texas hold’emTexas gambling site P(Texas|spam) = 1 – (1 – 2/5) 3 = CorpusIncoming Comment

P(spam|comment) = P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam) / P(comment) Probability of w 1 occurring given a spam comment

P(spam|comment) = P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam) / P(comment) Probability of w 1 occurring given a spam comment Probability of something being spam

P(spam|comment) = P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam) / P(comment) Probability of w 1 occurring given a spam comment Probability of something being spam ??????

P(spam|comment) = P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam) / P(comment) Probability of w 1 occurring given a spam comment Probability of something being spam ?????? P(ham|comment) = P(w 1 |ham) ∙ P(w 2 |ham) ∙ … P(w n |ham) ∙ P(ham) / P(comment)

P(spam|comment)  P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam) Probability of w 1 occurring given a spam comment Probability of something being spam P(ham|comment)  P(w 1 |ham) ∙ P(w 2 |ham) ∙ … P(w n |ham) ∙ P(ham)

P(spam|comment)  P(w 1 |spam) ∙ P(w 2 |spam) ∙ … P(w n |spam) ∙ P(spam)) P(ham|comment)  P(w 1 |ham) ∙ P(w 2 |ham) ∙ … P(w n |ham) ∙ P(ham)) log( ) )

log(P(spam|comment))  log(P(w 1 |spam)) + log(P(w 2 |spam)) + … log(P(w n |spam)) + log(P(spam)) log(P(ham|comment))  log(P(w 1 |ham)) + log(P(w 2 |ham)) + … log(P(w n |ham)) + log(P(ham))

P(spam|comment) = 1 – P(ham|comment) Fact: Abuse of notation: P(s) = P(spam|comment) P(h) = P(ham|comment)

P(s) = 1 – P(h) m = log(P(s)) – log(P(h)) = log(P(s)/P(h)) e m = e log(P(s)/P(h)) = P(s)/P(h) e m ∙ P(h) = P(s)

P(s) = 1 – P(h) m = log(P(s)) – log(P(h)) e m ∙ P(h) = P(s) e m ∙ P(h) = 1 – P(h) (e m + 1) ∙ P(h) = 1 P(h) = 1/(e m +1) P(s) = 1 – P(h)

m = log(P(s)) – log(P(h)) P(h) = 1/(e m +1) P(s) = 1 – P(h)

m = log(P(spam|comment)) – log(P(ham|comment)) P(ham|comment) = 1/(e m +1) P(spam|comment) = 1 – P(ham|comment)

log(P(ham|comment)) log(P(spam|comment)) In practice, just compare

Implementation

Corpus A collection of 50 blog pages with 1024 comments Manually tagged as spam/non-spam 67% are spam Provided by the Informatics Institute at University of Amsterdam Blocking Blog Spam with Language Model Disagreement, G. Mishne, D. Carmel, and R. Lempel. In: AIRWeb '05 - First International Workshop on Adversarial Information Retrieval on the Web, at the 14th International World Wide Web Conference (WWW2005), 2005.

Most popular spam words casino betting texas biz holdem poker pills pokerabc teen online bowl gambling sonneries blackjack pharmacy

“Clean” words edu projects week etc went inbox bit someone bike already selling making squad left important pimps

Implementation Corpus parsing and processing Naïve Bayes algorithm Randomly select 70% for training, 30% for testing Stand-alone web service Written entirely in Python

It’s showtime!

Configurations Separator used to tokenize comment Inclusion of words from header Classify based only on most significant words Double count non-spam comments Include article body as non-spam example Boosting

Minimum Error Configuration Separator: [^a-z<>]+ Header: Both Significant words: All Double count: No Include body: No Boosting: No

Varying Configuration Parameters

Boosting Naïve Bayes is applied repeatedly to the data. Produces Weighted Majority Model bayesModels = empty list weights = vector(1) for i in 1 to M: model = naiveBayes(examples, weights) error = computeError(model, examples) weights = adjustWeights(examples, weights, error) bayesModels[i] = [model, error] if error==0: break

Boosting

Future work (or what we did not do)

Data Processing Follow links in comment and include words in target web page More sophisticated tokenization and URL handling (handling $100,000...) Word stemming

Features Ability to incorporate incoming comments into corpus Ability to mark comment as spam/non- spam Assign more weight on page content Adjust probability table based on page content, providing content-sensitive filtering

Comments? No spam, please.