Presentation is loading. Please wait.

Presentation is loading. Please wait.

UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.

Similar presentations


Presentation on theme: "UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java."— Presentation transcript:

1 UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin http://ebiquity.umbc.edu/paper/html/id/299/ Pranam Kolari, Akshay Java and Tim Finin University of Maryland, Baltimore County 3 rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and Dynamics 22 May 2006

2 UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

3 UMBC an Honors University in Maryland The Blogosphere 57% of online US teens generate content, 40% read blogs, 20% have them! (Pew Nov. 2005) 53% of companies are blogging (Guideware Oct. 2005) MySpace accounts for 1/3 of all web clicks (Hendler, 2006) ?! But … the Blogosphere is awash in spam Source: Wikipedia

4 UMBC an Honors University in Maryland Blogosphere/Splogosphere

5 UMBC an Honors University in Maryland Spam in the Blogosphere Types: comment spam, ping spam, spam blogs Akismet: “87% of all comments are spam” 75% of update pings are spam (ebiquity 2005) 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) “Spam blogs, sometimes referred to by the neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” “Spings, or ping spam, are pings that are sent from spam blogs” 1 Wikipedia

6 UMBC an Honors University in Maryland Motivation: host ads

7 UMBC an Honors University in Maryland Motivation: index affiliates, promote pageRank

8 UMBC an Honors University in Maryland Spings from weblogs.com

9 UMBC an Honors University in Maryland “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, Any Search Engine, Any Keyword.” Where do Splogs come from? $ 197

10 UMBC an Honors University in Maryland

11 UMBC Our splog bait was picked up and used by dozens of sploggers

12 UMBC an Honors University in Maryland

13 UMBC Our feed is RSSjacked by at least one splogger

14 UMBC an Honors University in Maryland Why are splogs a problem? Splogs undermine ranking algorithms Splogs water down search results Splogs threaten the Web advertising model Splogs indulge in “plagiarism” Splogs skew results of market research tools Splogs stress the Blogosphere infrastructure of ping servers, blog search engines, etc.

15 UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

16 UMBC an Honors University in Maryland Splog Detection SVM based probabilistic splog detection (Kolari et al., 2006) Hand verified training set of blogs and splogs Precision/Recall of 87% Bag-of-words based feature using text on blog home-page, O(x) Some additional local features we what was my org flickr paper 600 open words weblog motion me thank go january trackback archives now political find info news your 27 another website best articles on perfect products uncategorized 280 hot resources inc 60 three copyright P( x is a splog | O(x) ) P( x is a blog | O(x) ) top features blogs splogs

17 UMBC an Honors University in Maryland This Work By characterizing the splogosphere, we aim to achieve the following: (i) Get a handle on the seriousness of the problem, (ii) Develop new techniques for splog detection, and (iii) Recommend placement of splog filters on the blogging infrastructure. Characterization is based on comparing the nature of authentic blogs against splogs to identify discriminating features

18 UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

19 UMBC an Honors University in Maryland BlogPulse Dataset 21 days of July 2005 1.3 million blogs Eliminated Live-Journal Re-fetched blog-homepages, many spam blogs were non- existent since spam blogs are short lived Arrived at 500K samples Set probability thresholds to 0.2 (authentic blog) and 0.8 (splog) Identified 27K splogs Sampled for 27K authentic blogs

20 UMBC an Honors University in Maryland Splogs vs. Blogs – Word Count blogssplogs blogs and splogs

21 UMBC an Honors University in Maryland http://www.engadget.com 1942 http://www.huffingtonpost.com/theblog 905 http://www.crooksandliars.com 637 http://blogs.guardian.co.uk/news 616 http://www.littlegreenfootballs.com/weblog 611 http://spaces.msn.com/members/pony-girl 505 http://spaces.msn.com/members/black-puss 505 http://spaces.msn.com/members/amputee-women 505 http://spaces.msn.com/members/free-stories 505 http://spaces.msn.com/members/first-time-girl 505 Top 5 Splogs vs. Blogs – In-degree

22 UMBC an Honors University in Maryland Splogs vs. Blogs – Out-degree http://www.xanga.com/home.aspx?user=hit_me_layoutz 273 http://www.xanga.com/home.aspx?user=i_jock_layouts 271 http://www.xanga.com/home.aspx?user=slp_layouts_slp 198 http://spaces.msn.com/members/cyrustse1986 193 http://www.xanga.com/home.aspx?user=layouts_n_codes2005 180 http://worldseriesofpokerchipscardguard.blogspot.com 898 http://rule-wsop.blogspot.com 898 http://worldseries-ofpoler.blogspot.com 898 http://qsopcom-1.blogspot.com 898 http://weopcom.blogspot.com 898 Top 5

23 UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

24 UMBC an Honors University in Maryland Weblogs.com Dataset 20 Nov 2005 – 11 Dec 2005 16 million update pings Pings subdivided by language: da, de, en, es, fi, fr, it, nl, pt, sv Heuristics to identify Japanese, Chinese, Korean Set threshold of 0.5 to separate out authentic blogs from splogs. 1 Thanks to James Mayfield, JHU APL

25 UMBC an Honors University in Maryland Ping times – Italian Blogs

26 UMBC an Honors University in Maryland Sping vs. Ping times

27 UMBC an Honors University in Maryland Spings vs. Pings: frequency blogs vs. their ping frequency follows a power law, but splogs vs. spings does not

28 UMBC an Honors University in Maryland Close to 40% spings Among English blogs –75% pings are spings –Authentic blogs are 13% of all pings Including Info domain –50% of all pings are spings urlcount http://www.wiccapaganblog.com1491 http://www.freecancerfacts.com/wp1452 http://www.myaquariumiplace.com1375 http://www.criss-angel.biz1215 http://www.microdermabrasion- secrets.com 1211 http://www.tipstohealth.com/blog1207 http://www.countrymusicdigest.com1191 All Pings – 16 Million

29 UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

30 UMBC an Honors University in Maryland Implications (1) BlogPulse dataset –Local word models most effective for fast splog detection –If splogs escape filters, in-link and out-link distribution point to link-based classification Weblogs.com dataset –Ping frequency can be useful –Splogs probably not a big problem in most European languages. Yet. The nature of the domain, points to spam filters employing a multi-step, and adaptive approach, which we are currently pursuing

31 UMBC an Honors University in Maryland Implications (2) – Filter Design Heuristics Spam Blog Filter Language Identifiers Spam Blog Detectors Blog Identifier Blog Identifier 1234 Authentic Blogs Spam Blogs IP Blacklists Supporting Info (OPTIONAL)

32 UMBC an Honors University in Maryland Conclusions Blog spam is a serious problem –Classic arms race, e.g., increased plagiarism, feedjacking Blog spam identification requires different tactics than used for email and Web spam –Local features effective, but not sufficient –Lots of relational features (e.g., links, ads, IP addresses, tight but disconnected communities) but dynamism reduces effectiveness of analysis Getting good training sets expensive, especially in a multilingual environment. –Minute or more a judgment Good opportunities for infrastructure insertion, e.g., sping free ping servers

33 UMBC an Honors University in Maryland http://ebiquity.umbc.edu/ Annotated in OWL For more information

34 UMBC an Honors University in Maryland Questions?

35 UMBC an Honors University in Maryland Blogs – A Specialized Domain Update Pings Ping Stream 1 2 Update Stream Fetch Content 3 4 1234 ()


Download ppt "UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java."

Similar presentations


Ads by Google