Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.

Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java and Tim Finin University of Maryland, Baltimore County 3 rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and Dynamics 22 May 2006

Outline Introduction Motivation BlogPulse Dataset Dataset Implications

The Blogosphere 57% of online US teens generate content, 40% read blogs, 20% have them! (Pew Nov. 2005) 53% of companies are blogging (Guideware Oct. 2005) MySpace accounts for 1/3 of all web clicks (Hendler, 2006) ?! But … the Blogosphere is awash in spam Source: Wikipedia

Blogosphere/Splogosphere

Spam in the Blogosphere Types: comment spam, ping spam, spam blogs Akismet: "87% of all comments are spam" 75% of update pings are spam (ebiquity 2005) 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) "Spam blogs, sometimes referred to by the neologism splogs, are weblog sites which the author uses only for promoting affiliated websites" "Spings, or ping spam, are pings that are sent from spam blogs" 1 Wikipedia

Motivation: host ads

Motivation: index affiliates, promote pageRank

Spings from

"Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…" "Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!" "Holy Grail Of Advertising... " "Easily Dominate Any Market, Any Search Engine, Any Keyword." Where do Splogs come from? $ 197

10 UMBC an Honors University in Maryland

Our splog bait was picked up and used by dozens of sploggers

12 UMBC an Honors University in Maryland

Our feed is RSSjacked by at least one splogger

Why are splogs a problem? Splogs undermine ranking algorithms Splogs water down search results Splogs threaten the Web advertising model Splogs indulge in "plagiarism" Splogs skew results of market research tools Splogs stress the Blogosphere infrastructure of ping servers, blog search engines, etc.

Outline Introduction Motivation BlogPulse Dataset Dataset Implications

Splog Detection SVM based probabilistic splog detection (Kolari et al., 2006) Hand verified training set of blogs and splogs Precision/Recall of 87% Bag-of-words based feature using text on blog home-page, O(x) Some additional local features we what was my org flickr paper 600 open words weblog motion me thank go january trackback archives now political find info news your 27 another website best articles on perfect products uncategorized 280 hot resources inc 60 three copyright P( x is a splog | O(x) ) P( x is a blog | O(x) ) top features blogs splogs

This Work By characterizing the splogosphere, we aim to achieve the following: (i) Get a handle on the seriousness of the problem, (ii) Develop new techniques for splog detection, and (iii) Recommend placement of splog filters on the blogging infrastructure. Characterization is based on comparing the nature of authentic blogs against splogs to identify discriminating features

Outline Introduction Motivation BlogPulse Dataset Dataset Implications

BlogPulse Dataset 21 days of July 2005 1.3 million blogs Eliminated Live-Journal Re-fetched blog-homepages, many spam blogs were non- existent since spam blogs are short lived Arrived at 500K samples Set probability thresholds to 0.2 (authentic blog) and 0.8 (splog) Identified 27K splogs Sampled for 27K authentic blogs

Splogs vs. Blogs – Word Count blogssplogs blogs and splogs

1942 905 637 616 611 505 505 505 505 505 Top 5 Splogs vs. Blogs – In-degree

Splogs vs. Blogs – Out-degree 273 271 198 193 180 898 898 898 898 898 Top 5

Outline Introduction Motivation BlogPulse Dataset Dataset Implications

Dataset 20 Nov 2005 – 11 Dec 2005 16 million update pings Pings subdivided by language: da, de, en, es, fi, fr, it, nl, pt, sv Heuristics to identify Japanese, Chinese, Korean Set threshold of 0.5 to separate out authentic blogs from splogs. 1 Thanks to James Mayfield, JHU APL

Ping times – Italian Blogs

Sping vs. Ping times

Spings vs. Pings: frequency blogs vs. their ping frequency follows a power law, but splogs vs. spings does not

Close to 40% spings Among English blogs –75% pings are spings –Authentic blogs are 13% of all pings Including Info domain –50% of all pings are spings urlcount http://www.wiccapaganblog.com1491 http://www.myaquariumiplace.com1375 http://www.criss-angel.biz1215 http://www.microdermabrasion- 1211 http://www.countrymusicdigest.com1191 All Pings – 16 Million

Outline Introduction Motivation BlogPulse Dataset Dataset Implications

Implications (1) BlogPulse dataset –Local word models most effective for fast splog detection –If splogs escape filters, in-link and out-link distribution point to link-based classification dataset –Ping frequency can be useful –Splogs probably not a big problem in most European languages. Yet. The nature of the domain, points to spam filters employing a multi-step, and adaptive approach, which we are currently pursuing

Implications (2) – Filter Design Heuristics Spam Blog Filter Language Identifiers Spam Blog Detectors Blog Identifier Blog Identifier 1234 Authentic Blogs Spam Blogs IP Blacklists Supporting Info (OPTIONAL)

Conclusions Blog spam is a serious problem –Classic arms race, e.g., increased plagiarism, feedjacking Blog spam identification requires different tactics than used for email and Web spam –Local features effective, but not sufficient –Lots of relational features (e.g., links, ads, IP addresses, tight but disconnected communities) but dynamism reduces effectiveness of analysis Getting good training sets expensive, especially in a multilingual environment. –Minute or more a judgment Good opportunities for infrastructure insertion, e.g., sping free ping servers

Annotated in OWL For more information

Questions?

Blogs – A Specialized Domain Update Pings Ping Stream 1 2 Update Stream Fetch Content 3 4 1234 ()

