UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.

Slides:



Advertisements
Similar presentations
SEARCHING THE BLOGOSPHERE
Advertisements

BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.
SEO in 2010 January 21 st, 2010 Steve Thomas President, The Net Impact.
Oct 7, 2006Presented By Leonard Doucette © 2006 Welcome to the “Erica Miller Spa School” at The Hills Health Ranch “E” Marketing and the Web.
All Things Search Attracting and understanding website visitors.
Search Engine Marketing Free Traffic for Your Web Site Paul Allen, CEO
Blogging Everything You Ever Wanted to know but were afraid to ask 1.
Slide 1 smallbiztrends.com Blogs: Today’s Marketing and Sales Tool for Business SEM Made Simple Anita Campbell – May 2007.
UMBC AN HONORS UNIVERSITY IN MARYLAND Increasing Research Visibility on the Web Building a presence on the Web for you and your research Tim Finin UMBC,
Engineering Village ™ ® Basic Searching On Compendex ®
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.
SEO Introduction & Process SEO Team S-Axxis Software Solutions
UMBC AN HONORS UNIVERSITY IN MARYLAND Future Research Challenges and Needed Resources for The Web, Semantics and Data Mining Tim Finin UMBC, Baltimore.
Analyzing Website Traffic Dan Belhassen greatBIGnews.com Modern Earth Inc.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Effective Use of Your Web site June 29, Agenda  Introduction  Statistics  Observations  Your Web Goals  Increasing Traffic  Optimizing the.
UMBC AN HONORS UNIVERSITY IN MARYLAND Increasing Research Visibility on the Web Building a presence on the Web for you and your research Tim Finin UMBC,
REVENUE MANAGEMENT GUIDE © Marin Management, Inc. 1 Online Networking Guide, 1570 MySpace ® A. MySpace ® Introduction Social networking sites, such as.
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
Adriana Iordan Web Marketing Manager / Avangate Social Networking Media How the software authors should use it?
Affiliate Marketing. What is Affiliate Marketing Type of performance-based marketing that rewards affiliates for generating leads or sales. The most common.
SEO Webinar - With Neil Palmer of IM3.co.uk In Partnership with Huddlebuy How do I improve my website traffic with SEO? Covering: What is SEO? Why is SEO.
Establishing Successful Business Online Alexei Kouleshov.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
Search Optimization Techniques Dan Belhassen greatBIGnews.com Modern Earth Inc.
Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin Partially supported by NSF award ITR-IIS and ITR-IDM and.
Increasing HG awareness on the web. Aim “cost-effective use of the internet to increase awareness, understanding and take-up of Human Givens ideas”
Natalie McAllister Jackson | Myappsanywhere ADVANCED SOCIAL MEDIA TACTICS FOR CREATIVE MARKETERS.
Getting Found Online: How to LEVERAGE Blogging to Grow Your Business & Your BOTTOM LINE! Presented by Heidi Richards Mooney Redhead Marketing Inc.
BLACK HAT SEO "Show Me The Money”. Keyword Selection.
OFF Page SEO Tips & Tricks Step By Step By IT Team of SlideLearn.com.
Modeling the Spread of Influence on the Blogosphere Akshay Java, Pranam Kolari, Tim Finin, and Tim Oates UMBC Tech Report 04/12/06.
A guide to Promoting your Business Online. Today’s Presentation  50 minutes Interactive “Presentation”  10 minutes Q & A  “General” Information  Please.
No MNC in India For Internet Marketing
Blogs and Wikis Dr. Norm Friesen. Questions What is a blog? What is a Wiki? What is Wikipedia? What is RSS?
Using Facebook to Connect With Customers Part 1. Outline Questions from Librarians Introduction to Facebook Uses for Facebook Facebook for Personal Use.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
The Business Model and Strategy of MBAA 609 R. Nakatsu.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Do's and don'ts to improve your site's ranking … Presentation by:
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
Evaluation of Spam Detection and Prevention Frameworks for and Image Spam - A State of Art Pedram Hayati, Vidyasagar Potdar Digital Ecosystems and.
1 Archiving Update June 9, 2003 Chuck Palsho President, NewsBank Media Services
Discovering Computers Fundamentals, Third Edition CGS 1000 Introduction to Computers and Technology Spring 2007.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Online Services. Advertising & Marketing Big supermarket companies use lots of different ways of “saving money!” Different ways includes Tesco’s Clubcard,
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
IBM Lotus Software © 2006 IBM Corporation IBM Lotus Notes Domino Blog Template Steve Castledine.
Search Engine Optimization Information Systems 337 Prof. Harry Plantinga.
SEARCH ENGINE OPTIMIZATION (SEO) Pamela Drake ENG 2720 Writing with New Media.
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Natalie Glance Senior Research Scientist Nielsen BuzzMetrics.
Optimizing today's websites using tomorrow's technologies.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08.
Event-Based Model for Reconciling Digital Entities Ahmet Fatih Mustacoglu Ahmet E. Topcu Aurel Cami Geoffrey C. Fox Indiana University Computer Science.
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
KeywordTool.com keywordtool.com What is KeywordTool.com?
Social Media Marketing: Social Media Websites is the right medium for all business marketing and its promotion. Social media can be effective branding.
Best Strategies For Website Promotion. What is Website Promotion? Website promotion is the continuing process used by webmasters to promote and bring.
Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song,
CCT356: Online Advertising and Marketing
A Machine Learning Approach
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Generative Model To Construct Blog and Post Networks In Blogosphere
What is Google Adwords? Adwords is the platform you can use as an advertiser to show your ads on the Google Search Results Pages, partner websites or the.
User Information Architecture: Blogs, Wikis, and RSS
Presentation transcript:

UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java and Tim Finin University of Maryland, Baltimore County 3 rd Annual Workshop on the Weblogging Ecosytem: Aggregation, Analysis and Dynamics 22 May 2006

UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

UMBC an Honors University in Maryland The Blogosphere 57% of online US teens generate content, 40% read blogs, 20% have them! (Pew Nov. 2005) 53% of companies are blogging (Guideware Oct. 2005) MySpace accounts for 1/3 of all web clicks (Hendler, 2006) ?! But … the Blogosphere is awash in spam Source: Wikipedia

UMBC an Honors University in Maryland Blogosphere/Splogosphere

UMBC an Honors University in Maryland Spam in the Blogosphere Types: comment spam, ping spam, spam blogs Akismet: “87% of all comments are spam” 75% of update pings are spam (ebiquity 2005) 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) “Spam blogs, sometimes referred to by the neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” “Spings, or ping spam, are pings that are sent from spam blogs” 1 Wikipedia

UMBC an Honors University in Maryland Motivation: host ads

UMBC an Honors University in Maryland Motivation: index affiliates, promote pageRank

UMBC an Honors University in Maryland Spings from weblogs.com

UMBC an Honors University in Maryland “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, Any Search Engine, Any Keyword.” Where do Splogs come from? $ 197

UMBC an Honors University in Maryland

UMBC Our splog bait was picked up and used by dozens of sploggers

UMBC an Honors University in Maryland

UMBC Our feed is RSSjacked by at least one splogger

UMBC an Honors University in Maryland Why are splogs a problem? Splogs undermine ranking algorithms Splogs water down search results Splogs threaten the Web advertising model Splogs indulge in “plagiarism” Splogs skew results of market research tools Splogs stress the Blogosphere infrastructure of ping servers, blog search engines, etc.

UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

UMBC an Honors University in Maryland Splog Detection SVM based probabilistic splog detection (Kolari et al., 2006) Hand verified training set of blogs and splogs Precision/Recall of 87% Bag-of-words based feature using text on blog home-page, O(x) Some additional local features we what was my org flickr paper 600 open words weblog motion me thank go january trackback archives now political find info news your 27 another website best articles on perfect products uncategorized 280 hot resources inc 60 three copyright P( x is a splog | O(x) ) P( x is a blog | O(x) ) top features blogs splogs

UMBC an Honors University in Maryland This Work By characterizing the splogosphere, we aim to achieve the following: (i) Get a handle on the seriousness of the problem, (ii) Develop new techniques for splog detection, and (iii) Recommend placement of splog filters on the blogging infrastructure. Characterization is based on comparing the nature of authentic blogs against splogs to identify discriminating features

UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

UMBC an Honors University in Maryland BlogPulse Dataset 21 days of July million blogs Eliminated Live-Journal Re-fetched blog-homepages, many spam blogs were non- existent since spam blogs are short lived Arrived at 500K samples Set probability thresholds to 0.2 (authentic blog) and 0.8 (splog) Identified 27K splogs Sampled for 27K authentic blogs

UMBC an Honors University in Maryland Splogs vs. Blogs – Word Count blogssplogs blogs and splogs

UMBC an Honors University in Maryland Top 5 Splogs vs. Blogs – In-degree

UMBC an Honors University in Maryland Splogs vs. Blogs – Out-degree Top 5

UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

UMBC an Honors University in Maryland Weblogs.com Dataset 20 Nov 2005 – 11 Dec million update pings Pings subdivided by language: da, de, en, es, fi, fr, it, nl, pt, sv Heuristics to identify Japanese, Chinese, Korean Set threshold of 0.5 to separate out authentic blogs from splogs. 1 Thanks to James Mayfield, JHU APL

UMBC an Honors University in Maryland Ping times – Italian Blogs

UMBC an Honors University in Maryland Sping vs. Ping times

UMBC an Honors University in Maryland Spings vs. Pings: frequency blogs vs. their ping frequency follows a power law, but splogs vs. spings does not

UMBC an Honors University in Maryland Close to 40% spings Among English blogs –75% pings are spings –Authentic blogs are 13% of all pings Including Info domain –50% of all pings are spings urlcount secrets.com All Pings – 16 Million

UMBC an Honors University in Maryland Outline Introduction Motivation BlogPulse Dataset Weblogs.com Dataset Implications

UMBC an Honors University in Maryland Implications (1) BlogPulse dataset –Local word models most effective for fast splog detection –If splogs escape filters, in-link and out-link distribution point to link-based classification Weblogs.com dataset –Ping frequency can be useful –Splogs probably not a big problem in most European languages. Yet. The nature of the domain, points to spam filters employing a multi-step, and adaptive approach, which we are currently pursuing

UMBC an Honors University in Maryland Implications (2) – Filter Design Heuristics Spam Blog Filter Language Identifiers Spam Blog Detectors Blog Identifier Blog Identifier 1234 Authentic Blogs Spam Blogs IP Blacklists Supporting Info (OPTIONAL)

UMBC an Honors University in Maryland Conclusions Blog spam is a serious problem –Classic arms race, e.g., increased plagiarism, feedjacking Blog spam identification requires different tactics than used for and Web spam –Local features effective, but not sufficient –Lots of relational features (e.g., links, ads, IP addresses, tight but disconnected communities) but dynamism reduces effectiveness of analysis Getting good training sets expensive, especially in a multilingual environment. –Minute or more a judgment Good opportunities for infrastructure insertion, e.g., sping free ping servers

UMBC an Honors University in Maryland Annotated in OWL For more information

UMBC an Honors University in Maryland Questions?

UMBC an Honors University in Maryland Blogs – A Specialized Domain Update Pings Ping Stream 1 2 Update Stream Fetch Content ()