Presentation is loading. Please wait.

Presentation is loading. Please wait.

SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,

Similar presentations


Presentation on theme: "SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,"— Presentation transcript:

1 SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006 http://ebiquity.umbc.edu

2 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection2 Blogosphere - the brighter side Panel View –Market Research –PR Monitoring From Presentations –Opinion Extraction –Demography based analysis

3 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection3 Blogosphere - the darker side (1) From the Panel –Blogger is cracking down splogs –SixApart and TypePad –Content Hijacking From Presentations –Removing SPAM an essential part of blog search engine –Cost of cleaning up splogs and its effect on results

4 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection4 Blogosphere - the darker side (2)

5 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection5 The Blogosphere Blogger msn-spaces livejournal Information Audience BLOG HOSTS PING SERVERS SPINGS SPLOGS

6 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection6 Spings – weblogs.com

7 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection7 Spings – weblogs.com (2)

8 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection8 Spings – weblogs.com (3)

9 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection9 Splogs – icerocket.com

10 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection10 Splogs – icerocket.com (2)

11 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection11 A Featured Splog?

12 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection12 Splogs – technorati.com (2)

13 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection13 “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, Any Search Engine, Any Keyword.” Splogs – The Source!

14 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection14 Spam we target -- summarized Non-blogs –For increased search engine exposure –Through BLOG IDENTIFICATION Splogs –Adsense clicks for high-paying contexts (i) –Unjustifiably increase page-rank (importance) of affiliates – link farms (ii) –Combination of (i) and (ii) –Through SPLOG DETECTION

15 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection15 This work  Can machine learning models be effective to counter splogs on the blogosphere?  How do they perform when using features local to a blog?

16 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection16 Dataset for Training Technorati random sampling 500K blogs – May/June 2005 Dropped those from top blogging hosts –Blog Identification is an easy tasking using just URL patterns/domains Sampled the rest in different ways to create training datasets

17 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection17 Blog-HomePage/Non-Blog Sampled for blog home-pages Sampled for external links from these blogs to capture contextually similar pages – but from non-blogs All samples were manually verified Training set consists of 2100 positive and 2100 negative samples – multiple languages Lets call this (BH, NB)

18 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection18 Blog-SubPage/Non-Blog Sampled for local-links from BH Sampled for out-links similar to NB No manual verification 2600 positive and 2600 negative samples Lets call this (BNH, NB)

19 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection19 Authentic Blog/Splog Manually identified 700 splogs (English) in the BH sample Sampled for 700 blogs from the rest 700 positive and 700 negative samples Lets call this (AB, S)

20 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection20 Comparison Baselines Feature PrecisionRecallF1 meta 1.75.85 RSS/Atom.96.90.93 Text - blog.88.79.83 Text – comment.83.87.85 Text – trackback.99.18.30 Text – 2005.56.97.71 Blog Identification Splog Detection is a known problem!

21 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection21 Evaluation - Background SVMs as implemented by libsvm Leave-One-Out cross-validation No stop word elimination No stemming Mutual Information for feature selection –Frequency count provided similar results Binary feature encoding –Others encodings give similar results

22 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection22 New features for blogs Hyper-links on a page –Tokenized by “/” and “-” Anchor-text on a page Meta tags –From HTML HEAD element 4-grams –Contiguous blocks of 4 characters Combinations –words and urls –meta and link –urls, anchors, meta

23 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection23 Blog Identification – (BH, NB) FeaturePrecisionRecallF1Feature Size Words (w).976.941.95819000 Urls (u).982.962.9727000 Anchors (a).975.926.9508000 Meta (m).981.774.8653500 w+u.985.966.97526000 m+LINK.973.939.9564000 u+a.985.961.97315000 u+a+m.986.964.97518500 4grams.982.964.97325000

24 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection24 Blog Identification – (BNH, NB) FeaturePrecisionRecallF1Feature Size Words (w).976.930.95219000 Urls (u).966.904.9347000 Anchors (a).962.897.9238000 Meta (m).981.919.9453500 w+u.979.932.95526000 m+LINK.919.942.9304000 u+a.977.919.94715000 u+a+m.989.940.96418500 4grams.976.930.95225000

25 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection25 Splog Detection - (AB, S) FeaturePrecisionRecallF1Feature Size Words (w).887.864.87519000 Urls (u).804.827.8157000 Anchors (a).854.807.8308000 Meta (m).741.747.7443500 w+u.893.869.88126000 m+LINK.736.755.7454000 u+a.858.833.84515000 u+a+m.866.841.85318500 4grams.867.844.85525000

26 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection26 An quick Analysis Ping Servers –Our analysis in December 2005 –At least 75% of pings are spings Technorati Index –Data from week of March 20, 2006 –Random queries to sample for 10K blogs –3K blogspot, 2.5K livejournal, 1.8K msn –We predict that 1.5K blogspot, 250 from LJ are splogs –Overall 2.5K/10K are splogs ~ 25% of the fresh index!

27 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection27 Blogosphere Spam - Summary Blogger msn-spaces livejournal Information Audience BLOG HOSTS PING SERVERS 75% 25% 50% 10%

28 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection28 And its not getting easier … But spammers still leave trails that can be exploited

29 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection29 Conclusion Blogosphere is prone to spam at various infrastructure points Local content based models can be quite effective by itself 75% of pings and further downstream, 25% of fresh content is spam Blogger’s problem is now livejournal’s problem, and now everyone’s problem Combining local and global splog models is our current direction

30 March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection30 Questions? Google “Splog Detection” memeta –http://memeta.umbc.eduhttp://memeta.umbc.edu eBiquity –http://ebiquity.umbc.eduhttp://ebiquity.umbc.edu –http://ebiquity.umbc.edu/bloggerhttp://ebiquity.umbc.edu/blogger Check out Umbria’s report on splogs –http://www.umbrialistens.com/files/uploads /umbria_splog.pdf


Download ppt "SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,"

Similar presentations


Ads by Google