Presentation is loading. Please wait.

Presentation is loading. Please wait.

Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

Similar presentations


Presentation on theme: "Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007."— Presentation transcript:

1 Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007

2 THESIS STATEMENT It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

3 CONTRIBUTIONS (i)a principled study of the characteristics of the problem, (ii)a well motivated feature discovery effort, (iii)a cost-sensitive, real-time filtering implementation, and (iv)an ensemble driven classifier co-evolution.

4 Introduction Characterization Feature Discovery Cost-aware pipeline Adaptive Classifiers Evaluation Conclusions Future Directions OUTLINE

5 WHAT IS SPAM? “Unsolicited usually commercial e-mail sent to a large number of addresses” – Merriam Webster Online As the Internet has supported new applications, many other forms are common, requiring a much broader definition Capturing user attention unjustifiably in Internet enabled applications (e-mail, Web, Social Media etc..)

6 DIRECTINDIRECT E-Mail Spam General Web Spam Spam Blogs (Splogs) SPAM TAXONOMY IM Spam (SPIM) Spamdexing INTERNET SPAM [Forms] [Mechanisms] Social Network Spam Comment Spam Bookmark Spam Social Media Spam

7 SPAMDEXING Spam pages, Spam Blogs, Spam Comments, Guestbook Spam Wiki Spam SERP Search Engines Affiliate Programs Context Ads ads/affiliate linksarbitrage in-links spamdex JavaScript Redirect Affiliate Program Buyers Spam pages, Spam Blogs [DOORWAY] Spammer owned domains (i) (ii) (iii)

8 SPAM BLOG Auto-generated and/or Plagiarized Content Advertisements in Profitable Contexts Link Farms to promote other spam pages

9 Introduction Characterization Feature Discovery Cost-aware pipeline Adaptive Classifiers Evaluation Conclusions Future Directions OUTLINE

10 CONTRIBUTIONS (i)a principled study of the characteristics of the problem, (ii)a well motivated feature discovery effort, (iii)a cost-sensitive, real-time filtering implementation, and (iv)an ensemble driven classifier co-evolution.

11 WorldNet defines characterize as “to describe or portray the characters or the qualities or peculiarities” Our efforts –Define and Scope the Problem –Field Study –Principled Empirical Analysis –Publicize and solicit feedback CHARACTERIZATION

12 Update Pings Ping Stream 1 2 Fetch Content 3 Splog Filtering between steps 2 and 3 (Pre-indexing), used by blog harvester SCOPE

13 Bias of Search engines to blogs –through quick indexing (ping servers) –and higher relevance (temporal) Availability of third party blogging platforms –providing service for free –supporting programmatic content injection –enjoying high authority and trust (e.g. blogspot) –enabling obfuscation (doorways) to search engines and DMCA notices BLOGS & SPAMDEXING

14 56% of all active blogs are splogs! (2007) SPLOGS BY NUMBERS 75% of update pings (eBiquity 2006) 20% of indexed Blogosphere (Umbria 2006) 56% of update pings (eBiquity 2007)

15 Given a blog, is it authentic or spam? Explore evidence space –Contents of the Blogs (Local Attributes) –Evidence from Neighbors (Global Attributes) SPLOG DETECTION PROBLEM P(splog(x)/ O(x)) P(splog(x)/ L(x))

16 EXISTING CONTEXTS E-MAIL WEB BLOGS time time/posts Image Spam, Character Salad Scripts, Doorways Temporal Deception Users E-mail Service Provider Search Engines Page Hosting Services (e.g. Tripod) Web Search Engines Blog Search Engines Blog Hosting Services (Ping Servers) Fast Detection Low Overhead Online Batch Detection Mostly Offline Fast Detection Low Overhead NATURE WHO USES IT? CONSTRAINTS ATTACKS

17 Local Content (Drost et al, 2005) –using TFIDF word-features, specialized features etc. Statistical Properties (Fetterly et al, 2004) –using page updates, identical pages through page-stitching Trust-Rank (Gyongi et al, 2004) –As an extension to Page-Rank Splog Detection (Salvetti et al, Lin et al) RELATED WORK – WEB SPAM

18 Introduction Characterization Feature Discovery Cost-aware pipeline Adaptive Classifiers Evaluation Conclusions Future Directions OUTLINE

19 CONTRIBUTIONS (i)a principled study of the characteristics of the problem, (ii)a well motivated feature discovery effort, (iii)a cost-sensitive, real-time filtering implementation, and (iv)an ensemble driven classifier co-evolution.

20 Document as vectors in a feature space Feature Space –Discovery –Representation –Selection Classification Techniques –Support Vector Machines (Discriminative) –Naïve Bayes Classifier (Generative) Tools (libsvm, weka) MACHINE LEARNING CLASSIFICATION f 1, f 2, f 3.. f m

21 Precision (P) –a measure of correctness of classified documents Recall (R) –a measure of completeness of classified documents F-1 = 2*P*R/(P+R) ROC AUC * – Area Under the Curve –a measure of discriminatory power MACHINE LEARNING EVALUATION * Presented in Thesis Document

22 SPLOG-2005 –Sampled Summer 2005 at Technorati –Labeled samples of 700 blogs and 700 splogs –Only Blog-homepages SPLOG-2006 –Sampled Oct 2006 at Weblogs.com –Labeled samples of 750 blogs and 750 splogs –Blog-homepages + feeds DATASETS

23 EXPERIMENTAL SETUP Binary feature encoding Top 50K selected using frequency count SVMs –Default parameters –Linear Kernel No stemming or stop word elimination Naïve Bayes Ten fold cross-validation

24 URL 20052006

25 URL 3,4,5 charactergrams from URL Captures profitable contexts Highly effective at ping streams Supports an extremely low cost classifier 20052006

26 WORDS 20052006

27 WORDS 20052006 Words (Text) on a Blog Previously effective in topic classification Captures profitable advertising contexts Interesting Authentic Genre Observed

28 WORDGRAMS 20052006

29 WORDGRAMS 20052006 Word-2-grams, 2 adjacent words Shallow NLP technique to tackle word salad Word salad less common in web spam (TFIDF) Word-x-gram features, exponential with x

30 CHARACTERGRAMS 20052006

31 CHARACTERGRAMS 20052006 3,4,5 charactergrams from blog content Can capture character salad (e.g. p1lls) Feature selection important

32 OUTLINKS 20052006

33 OUTLINKS 20052006 Out-links tokenized by non-alphabets Similar to URL n-grams, likely more robust Novel feature space

34 ANCHORS 20052006

35 ANCHORS 20052006 Anchor text tokenized into words Subsumed by words, but obfuscation difficult Capture personalization of publishing template Novel feature space

36 “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, Any Search Engine, Any Keyword.” Splog software ?! $ 197

37 Capture HTML Stylistic Patterns in Authentic Blogs

38 HTMLTAGS 20052006

39 HTMLTAGS 20052006 Use HTML Tags – stylistic information Capture signatures of splog software Fully language independent Novel feature space

40 FEED BASED DETECTION Limitations using only home-pages –No knowledge of blog lifetime –Classifiers less effective in early lifecycle Benefits of using feeds –Most recent posts, lifetime, metadata –Capture correlations across posts Limitations of using only feeds –Loose out signatures in publishing template

41 FEED ITEM DISTRIBUTON Plot number of items in feeds (SPLOG-2006) Authentic Blogs feature normal distribution Splogs – many with just one post Knowledge of classifier effectiveness vs. lifetime

42 FEED BASED DETECTION Disjoint feature spaces – Words, Tags Trained and Tested with n (x-axis) posts Publishing template signatures important Tags much more effective – early lifecycle

43 RELATED CLASSIFIERS Blog Identification –Competency requirement for blog harvesters –F-1 measure of 98% Relational Features –Less Effective (High P, Low R) –Short-lived blogs, lifetime dependent –Knowledge of Web-graph Derived Features –Less Effective

44 FEATURE SPACE OBSERVATIONS Cost based classifier bucketing Known Feature Spaces –Words continue to be effective –Word-grams against obfuscation Novel Feature Spaces –Out-links, Anchors capture useful signals –HTML Tags very effective, even early lifecycle Feature Space Exploration –Tags, JavaScript, Feed Classification

45 Introduction Characterization Feature Discovery Cost-aware pipeline Adaptive Classifiers Evaluation Conclusions Future Directions OUTLINE

46 CONTRIBUTIONS (i)a principled study of the characteristics of the problem, (ii)a well motivated feature discovery effort, (iii)a cost-sensitive, real-time filtering implementation, and (iv)an ensemble driven classifier co-evolution.

47 META-PING SYSTEM Regular Expression Filtering (March 2005) List of Authentic Blogs (August 2005) Blog Home-page Classifier (December 2005) URL Classifier (October 2006) Feed Classifier (May 2007) Cost-Aware Pipeline Implementation (Jan 2007)

48 BLOG IDENTIFIER BLOG IDENTIFIER LANGUAGE IDENTIFIER LANGUAGE IDENTIFIER PING LOG PING LOG PRE-INDEXING SPING FILTER REGULAR EXPRESSIONS BLACKLISTS WHITELISTS BLACKLISTS WHITELISTS URL FILTERS HOMEPAGE FILTERS FEED FILTERS AUTHENTIC BLOGS IP BLACKLISTS Ping Stream META-PING SYSTEM Increasing Cost

49 META-PING SYSTEM Static Design –Project specific thresholds –Classifiers in pipeline –Based on accrued domain knowledge Dynamic Possibilities –Classifier Thresholds –Classifier use –Queuing analysis and Precision/Recall requirements

50 Introduction Characterization Feature Discovery Cost-aware pipeline Adaptive Classifiers Evaluation Conclusions Future Directions OUTLINE

51 CONTRIBUTIONS (i)a principled study of the characteristics of the problem, (ii)a well motivated feature discovery effort, (iii)a cost-sensitive, real-time filtering implementation, and (iv)an ensemble driven classifier co- evolution.

52 Change in distribution in feature space Concept Drift – Seasonal, seen in both splogs and blogs Adversarial Scenario – seen in splogs Concept Description needs to be updated ADAPTIVE CONTEXT f 1, f 2, f 3.. f m P( O(x)/splog(x) ) P( splog(x)/O(x) )

53 ENSEMBLE INTUITION Stream of unlabeled instances (drifting) Base classifiers with potentially independent feature spaces Is an ensemble (probabilistic committee) of the catalogue more robust to drift? Are instances classified by the ensemble effective to retrain base classifiers (semi- supervised learning)? Motivated by co-training

54 ENSEMBLE INTUITION base classifiersupdated classifiers ensemble committee (probabilistic) classify retrain unlabeled instances

55 ENSEMBLE APPROACH ensemble committee probabilistic base classifiers

56 POTENTIAL TO ADAPT URL Anchor Chargram Outlink Tag Wordgrams Words

57 EXPERIMENTAL SETUP A catalog of seven classifiers SPLOG-2005 as base labeled dataset SPLOG-2006 as evaluation stream 10K Top Features SVM based learning SPLOG-2006 separated out into unlabeled stream and test set (3-fold) F-1 performance metric evaluation

58 RESULTS – WORD DRIFT

59 RESULTS – ALL CLASSIFIERS

60 ENSEMBLE PROPERTIES Effectiveness tied to properties of ensemble Precision 92%, Recall 93% – 5 points over best base classifier Ensemble Diversity –a measure of disagreement between base classifier –maintain error of base classifiers

61 ENSEMBLE PROPERTIES Different metrics for diversity Q-statistic compares pairs of classifiers in an ensemble, [-1, +1] -1 most diverse, +1 least diverse N 11 N 00 - N 01 N 10 N 11 N 00 + N 01 N 10 Q = N 11 – Both classify correctly N 00 – Both misclassify N 10 – Misclassification by 2 nd N 01 – Misclassification by 1 st

62 ENSEMBLE PROPERTIES cgramwgramwordtagoutlinkanchorurl cgram 1.00.670.86-0.230.350.580.08 wgram 0.671.00.77-0.080.620.560.11 word 0.860.771.0-0.190.530.760.04 tag -0.23-0.08-0.191.00.15-0.120.24 outlink 0.350.620.530.151.00.450.10 anchor 0.580.560.76-0.120.451.00.03 url 0.080.110.040.240.100.031.0

63 ADAPTIVE - OBSERVATIONS Interplay between co-training and adversarial classification Maintaining and exploiting (ensemble) a catalogue of features effective Unlike existing work in concept drift –Real-world data –Stream of unlabeled instances Novel “feature spaces” key to adaptive, adversary resilient classifiers

64 Introduction Characterization Feature Discovery Adaptive Classifiers Cost-aware pipeline Evaluation Conclusions Future Directions OUTLINE

65 THESIS STATEMENT It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

66 BLOG IDENTIFIER BLOG IDENTIFIER LANGUAGE IDENTIFIER LANGUAGE IDENTIFIER PING LOG PING LOG PRE-INDEXING SPING FILTER REGULAR EXPRESSIONS BLACKLISTS WHITELISTS BLACKLISTS WHITELISTS URL FILTERS HOMEPAGE FILTERS FEED FILTERS AUTHENTIC BLOGS IP BLACKLISTS Ping Stream META-PING SYSTEM Increasing Cost

67 META-PING IMPLEMENTATION Multithreaded, distributed Java implementation Regular Expressions, accrued over two years, tested using white-lists Blacklists - IP Address from known domain, learnt using higher cost classifiers libsvm toolkit for probabilistic classifiers Project specific classifier choices and thresholds

68 META-PING EVALUATON Effective sub-modules –Evaluation of effective features –Harvard (Blog Identification, Word-based classifier), UMich (shared results) –********, *******, LMCO Efficient solution –Pipeline deployment at UMBC (January 2007) –Ping filtering for two months (3 machines – 40 threads) Adaptive ready (offline) –Evaluation using year apart real-world datasets

69 Introduction Characterization Feature Discovery Adaptive Classifiers Cost-aware pipeline Evaluation Conclusions Future Directions OUTLINE

70 THESIS STATEMENT It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

71 CONCLUSIONS Characterizing the Problem of Spam Blogs –Helps Drive Solutions –Readies tackling new emerging problems (e.g. Social Media spam) Feature Spaces effective for text classification are also useful here New feature spaces are quite effective, and could potentially be useful in other domains

72 CONCLUSIONS Using classifier costs to drive a pipeline based implementation can lead to an efficient filtering solution Semi-supervised ensemble approach can enable adaptive classifiers –Could be useful in domains (adversarial) that use a catalogue of classifiers –Proactive techniques are feasible for web spam detection

73 Introduction Characterization Feature Discovery Adaptive Classifiers Cost-aware pipeline Evaluation Conclusions Future Directions OUTLINE

74 DIRECTINDIRECT E-Mail Spam General Web Spam Spam Blogs (Splogs) SOCIAL MEDIA SPAM IM Spam (SPIM) Spamdexing INTERNET SPAM [Forms] [Mechanisms] Social Network Spam Comment Spam Bookmark Spam Social Media Spam

75 SOCIAL MEDIA SPAM Spam in social “microcosms” on the Web Spam on the Web –Spamdexing –Social Media Spam Social Media Spam serves two purposes –Local effects initially –Global effects subsequently (spamdexing) Detection efforts should address deployment contexts (microcosm, search)

76 OPEN PROBLEMS Feature Sophistication in new feature spaces, HTML Tags, JavaScript, Feeds Cost-aware pipeline (dynamic) Adversarial Classification, interplay with concept drift, catalog of features Active Learning and Adversarial Classification in the “catalogue” context Social Media Spam

77 THANK YOU!


Download ppt "Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007."

Similar presentations


Ads by Google