SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,

Slides:



Advertisements
Similar presentations
BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.
Advertisements

WORDPRESS. SEO AKA – “Search Engine Optimization” Technique to make sure large search engines like Google, Yahoo, and Bing find your site and let others.
SEO Best Practices with Web Content Management Brent Arrington, Services Developer, Hannon Hill Morgan Griffith, Marketing Director, Hannon Hill 2009 Cascade.
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.
UMBC AN HONORS UNIVERSITY IN MARYLAND Future Research Challenges and Needed Resources for The Web, Semantics and Data Mining Tim Finin UMBC, Baltimore.
Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.
Information Retrieval in Practice
The process of increasing the amount of visitors to a website by ranking high in the search results of a search engine.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Information Retrieval
Search Engine Optimization (SEO)
Overview of Search Engines
Search Engine Optimization HOW AND WHY Introduction to SEO SEO stands for “Search Engine Optimization” and often refers to the ability to easily locate.
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
Promote your website and get top listed in search engines Section E2 Andreas Livadiotis.
Search Engine Optimization
What is SEO? Making your site’s content easy to find through external search engines such as Google, Yahoo! and Bing.
Link Building Strategies You Can Use To Increase Your Rankings, Sales & Profits By Nicole Munoz.
Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin Partially supported by NSF award ITR-IIS and ITR-IDM and.
1 A Quantitative Study of Forum Spamming Using Context-Based Analysis Yi-Min Wang^ Ming Ma^ Yuan Niu* Hao Chen* Francis Hsu* *UC Davis, ^Microsoft Research.
Introduction to SEO August 2011 NowSourcing, Inc..
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
© 2006 Stephan M Spencer Netconcepts Search Engine Marketing by Stephan Spencer President, Netconcepts.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Search Engines & Search Engine Optimization (SEO).
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.
Topics in Technology and Marketing The Awesomeness That Is Google.
The Internet 8th Edition Tutorial 4 Searching the Web.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
5 Weeks Due Date April 15. Content Not Key Google performs 3 Billion Searches a day.
Basic Search Engine Optimization. What is SEO?  SEO is an abbreviation for search engine optimization.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
SES San Jose Search Engine Marketing 2006 Retaining Traffic After Moving Or Redesign.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
SMX Madrid 2008 Uncovering the Algorithm A Peek Inside How Google Evaluates and Ranks Pages.
IBM Lotus Software © 2006 IBM Corporation IBM Lotus Notes Domino Blog Template Steve Castledine.
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Natalie Glance Senior Research Scientist Nielsen BuzzMetrics.
+ Publishing Your First Post USING WORDPRESS. + A CMS (content management system) is an application that allows you to publish, edit, modify, organize,
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
Week 1 Introduction to Search Engine Optimization.
NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08.
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Online Copywriting eMarketing: The Essential Guide to Online Marketing
SEARCH ENGINE OPTIMIZATION, SECURITY, MAINTENANCE.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
Search Engine Optimization Miami (SEO Services Miami in affordable budget)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Presentation by Sunitha SEO Company in India- KG Tech
Information Retrieval in Practice
SEARCH ENGINE OPTIMIZATION.
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
Search Engine Architecture
Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song,
CCT356: Online Advertising and Marketing
WEB SPAM.
A Machine Learning Approach
Attracting more traffic is the basic objective of any website owner. A website doesn’t do the job by itself - it requires a push in a right direction.
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
1 SEO is short for search engine optimization. Search engine optimization is a methodology of strategies, techniques and tactics used to increase the amount.
Search Search Engines Search Engine Optimization Search Interfaces
Digital PR Fabiola Panarella
SEO Hand Book.
Web Search Engines.
Presentation transcript:

SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs, Stanford, March 27-29,

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection2 Blogosphere - the brighter side Panel View –Market Research –PR Monitoring From Presentations –Opinion Extraction –Demography based analysis

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection3 Blogosphere - the darker side (1) From the Panel –Blogger is cracking down splogs –SixApart and TypePad –Content Hijacking From Presentations –Removing SPAM an essential part of blog search engine –Cost of cleaning up splogs and its effect on results

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection4 Blogosphere - the darker side (2)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection5 The Blogosphere Blogger msn-spaces livejournal Information Audience BLOG HOSTS PING SERVERS SPINGS SPLOGS

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection6 Spings – weblogs.com

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection7 Spings – weblogs.com (2)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection8 Spings – weblogs.com (3)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection9 Splogs – icerocket.com

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection10 Splogs – icerocket.com (2)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection11 A Featured Splog?

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection12 Splogs – technorati.com (2)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection13 “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, Any Search Engine, Any Keyword.” Splogs – The Source!

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection14 Spam we target -- summarized Non-blogs –For increased search engine exposure –Through BLOG IDENTIFICATION Splogs –Adsense clicks for high-paying contexts (i) –Unjustifiably increase page-rank (importance) of affiliates – link farms (ii) –Combination of (i) and (ii) –Through SPLOG DETECTION

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection15 This work  Can machine learning models be effective to counter splogs on the blogosphere?  How do they perform when using features local to a blog?

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection16 Dataset for Training Technorati random sampling 500K blogs – May/June 2005 Dropped those from top blogging hosts –Blog Identification is an easy tasking using just URL patterns/domains Sampled the rest in different ways to create training datasets

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection17 Blog-HomePage/Non-Blog Sampled for blog home-pages Sampled for external links from these blogs to capture contextually similar pages – but from non-blogs All samples were manually verified Training set consists of 2100 positive and 2100 negative samples – multiple languages Lets call this (BH, NB)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection18 Blog-SubPage/Non-Blog Sampled for local-links from BH Sampled for out-links similar to NB No manual verification 2600 positive and 2600 negative samples Lets call this (BNH, NB)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection19 Authentic Blog/Splog Manually identified 700 splogs (English) in the BH sample Sampled for 700 blogs from the rest 700 positive and 700 negative samples Lets call this (AB, S)

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection20 Comparison Baselines Feature PrecisionRecallF1 meta RSS/Atom Text - blog Text – comment Text – trackback Text – Blog Identification Splog Detection is a known problem!

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection21 Evaluation - Background SVMs as implemented by libsvm Leave-One-Out cross-validation No stop word elimination No stemming Mutual Information for feature selection –Frequency count provided similar results Binary feature encoding –Others encodings give similar results

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection22 New features for blogs Hyper-links on a page –Tokenized by “/” and “-” Anchor-text on a page Meta tags –From HTML HEAD element 4-grams –Contiguous blocks of 4 characters Combinations –words and urls –meta and link –urls, anchors, meta

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection23 Blog Identification – (BH, NB) FeaturePrecisionRecallF1Feature Size Words (w) Urls (u) Anchors (a) Meta (m) w+u m+LINK u+a u+a+m grams

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection24 Blog Identification – (BNH, NB) FeaturePrecisionRecallF1Feature Size Words (w) Urls (u) Anchors (a) Meta (m) w+u m+LINK u+a u+a+m grams

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection25 Splog Detection - (AB, S) FeaturePrecisionRecallF1Feature Size Words (w) Urls (u) Anchors (a) Meta (m) w+u m+LINK u+a u+a+m grams

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection26 An quick Analysis Ping Servers –Our analysis in December 2005 –At least 75% of pings are spings Technorati Index –Data from week of March 20, 2006 –Random queries to sample for 10K blogs –3K blogspot, 2.5K livejournal, 1.8K msn –We predict that 1.5K blogspot, 250 from LJ are splogs –Overall 2.5K/10K are splogs ~ 25% of the fresh index!

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection27 Blogosphere Spam - Summary Blogger msn-spaces livejournal Information Audience BLOG HOSTS PING SERVERS 75% 25% 50% 10%

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection28 And its not getting easier … But spammers still leave trails that can be exploited

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection29 Conclusion Blogosphere is prone to spam at various infrastructure points Local content based models can be quite effective by itself 75% of pings and further downstream, 25% of fresh content is spam Blogger’s problem is now livejournal’s problem, and now everyone’s problem Combining local and global splog models is our current direction

March 29, 2006P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection30 Questions? Google “Splog Detection” memeta – eBiquity – – Check out Umbria’s report on splogs – /umbria_splog.pdf