Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.

Slides:



Advertisements
Similar presentations
BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.
Advertisements

Teaching Using the Internet in Your Classroom.
Basic Searching Engineering Village. Agenda What is Engineering Village? Setting up a personal account Searching Engineering Village How to.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Engineering Village ™ Basic Searching.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Blogging Everything You Ever Wanted to know but were afraid to ask 1.
Engineering Village ™ ® Basic Searching On Compendex ®
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
UMBC AN HONORS UNIVERSITY IN MARYLAND Future Research Challenges and Needed Resources for The Web, Semantics and Data Mining Tim Finin UMBC, Baltimore.
Information Retrieval in Practice
Search Engines and Information Retrieval
1 Adaptive Management Portal April
The Marketing Landscape. Partnering & Packaging Creates authentic experiences that provide a unique sense of place Keeps visitors in town longer Stretches.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
Overview of Search Engines
SEO PACKAGES. Types of Plans Starter Plan Business Plan Enterprises Plan.
Organic Website Marketing and Online Reputation Management To Boost Traffic, Visibility and Targeted Audience Table of content Introduction Service On.
Introduction To Blogging Sarah Mapel 9 October 2007.
OPTIMISING AND PROMOTING YOUR WEBSITE Michael Heraghty, Heraghty Internet Consultants
Business Overview Who Is ROCKETinfo?. The Business Rocketinfo is a Web 2.0 Company focusing on providing Web-based information. The goal is to provide.
Web 2.0: Concepts and Applications 2 Publishing Online.
Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin Partially supported by NSF award ITR-IIS and ITR-IDM and.
PhishScore: Hacking Phishers’ Minds
Search Engines and Information Retrieval Chapter 1.
websites that work James Pennington Lead IT Consultant.
Adversarial Information Retrieval on the Web or How I spammed Google and lost Dr. Frank McCown Search Engine Development – COMP 475 Mar. 24, 2009.
RSS Feeds What, Why, & How… …without a CMS Don Parsons
Web 2.0: Concepts and Applications 2 Publishing Online.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
The Internet 8th Edition Tutorial 4 Searching the Web.
Marshall Breeding Director for Innovative Technology and Research Vanderbilt University
Evaluation of Spam Detection and Prevention Frameworks for and Image Spam - A State of Art Pedram Hayati, Vidyasagar Potdar Digital Ecosystems and.
PART 1: INTRODUCTION TO BLOG Instructor: Mr Rizal Arbain FB:Facebook/rizal.arbain Website: H/P: Ibnu.
1. About Us 2 Social Annex spun out of Immply Group – a web development and design agency specializing in Social media, CMS, social networking and eCommerce.
MRIA is presenting the 3 rd Free Webinar in the series January 20, 2010.
IBM Lotus Software © 2006 IBM Corporation IBM Lotus Notes Domino Blog Template Steve Castledine.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Natalie Glance Senior Research Scientist Nielsen BuzzMetrics.
Optimizing today's websites using tomorrow's technologies.
Blogging. Website and blog A website, also written as web site,or simply site, is a set of related web pages typically served from a single web domain.
What is…. A Little History…  The term “Web 2.0” was familiarized when Tim O’Reilly hosted the first Web 2.0 conference in 2004  This Link (a characteristic.
Kendra Hunter & Charde Johnson EDUC Dr. M. Kariuki.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
+ “Introduction to Blogging” Katelyn Jacobsen By WordPress.org.
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
Introduction to Social Media October 28, 2010 Green County High School Vickie Buckman.
 GEETHA P.  Originally coined by Tim O’Reilly Publishing Media  Second generation of services available on www.  Lets people collaborate and share.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Information Retrieval in Practice
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
Search Engine Architecture
WEB SPAM.
A Machine Learning Approach
What is a Blog? short for Weblog journal on a website
Tim Smith CERN Geneva, Switzerland
“Real Simple Syndication” (RSS)
Extraction, aggregation and classification at Web Scale
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Trust on Blogosphere using Link Polarity Anubhav Kale, Akshay Java, Pranam Kolari, Dr Anupam Joshi, Dr Tim Finin Motivation Link Polarity Computation.
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
CMP Creating Your Personal and Small Business Web Sites
Presentation transcript:

Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau University of Maryland, Baltimore County NIST Blog Pre-Track, 14 Nov 2006 James Mayfield Johns Hopkins University Applied Physics Laboratory

Blogosphere Reputation at Stake!

Spam in the Blogosphere Types: comment spam, ping spam, spam blogs Akismet: “87% of all comments are spam” 75% of update pings are spam (ebiquity 2005) 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) “Spam blogs, sometimes referred to by the neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” 1 “Spings, or ping spam, are pings that are sent from spam blogs” 1 1 Wikipedia

Auto-generated and/or Plagiarized Content Advertisements in Profitable Contexts Link Farms to promote affiliates

Why a problem? Blogosphere increasingly important segment of Web; ~12 hours from post to Google index Splog content provides no additional value Splog content is often plagiarized Splogs demote value of authentic content Splogs steal advertising (referral) revenue from authentic content producers Splogs stress the blogosphere infrastructure Splogs can skew Blog Analytics, as was observed in TREC Blog Track 2006

Nature of Splogs in TREC 2006 Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks 81K blogs could be processed We use splog detection models developed on blog home-pages; 87% accuracy We identified 13,542 splogs Blacklisted 543K permalinks from these splogs ~16% of the entire collection ~17% splog posts injected into TREC dataset 1 1 The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis

Impact of Splogs in TREC Queries American Idol Cholesterol Hybrid Cars

Higher in Spam Prone Contexts Spam query terms based on analysis by McDonald et al Card Interest Mortgage

Splog Detection Task Proposal Motivation –Detecting and eliminating spam is an essential requirement for any blog analysis –Splog detection has characteristics that set it appart from and web spam detection Constraint –Simulate how blog search systems operate Task Statement –Is an input permalink (post) spam?

Relation to Spam Detection TREC has an Spam Classification Task Similar in –Fast online spam detection Different in –Nature of spamming: links, RSS feeds, web graph, metadata –Users targeted indirectly through search engines, e.g. “N1ST” not relevant for “NIST” query

Relation to Web Spam Detection TREC does not have a web spam track Similar in –Spamming web link structure Different in –Coverage: Blog Analytics Engines don’t look beyond blogosphere –Speed of detection is important, 150K posts/hour –Presence of structured text through RSS feeds presents new opportunities, and challenges

Difficulty We have been experimenting with multiple approaches starting mid 2005 Data:

Difficulty Evolving spamming techniques and splog creation genres Most basic technique spam techniques –Generate content by stuffing key dictionary words –Generate link to affiliates, through link dumps on blogrolls, linkrolls or after post content Evolving spam techniques –Scrape contextually similar content to generate posts –RSS hijacking –Aggregation software, e.g. Planet X –Intersperse links randomly –Make link placement meaningful –Add spam comments and then ping. Repeat.

Task Details - Dataset Creation Similar to TREC Blog 2006, a collection of feeds, blog home-pages and permalinks View dataset D as two sets – D base, D test D base to span (n-x) days, and D test to span the rest of x days for x≤1 D could collected as a combination of – D as collected in 2006 –Sample a subset of pings from a ping server over the period that D is collected

Task Details - Assessment Assessors classify spam post into one or more classes based on the kind of spam this post, or the blog hosting it features –Non-blog –Keyword-stuffed –Post-stitching –Post-plagiarism –Post-weaving –Blog/link-roll spam Each assessment typically takes 1-2 minutes Detailed assessment will enable participants to identify classes they handle well and where they can improve

Non-Blog ping at weblogs.com No RSS Feeds No Dated Entry, no comments Possibly plagiarized content

Keyword Stuffed Blog ‘coupon codes’, ‘casino’

Post Stitching Excerpts scraped from other sources

Post Weaving Spam Links contextually placed in post

Link-roll spam With fully plagiarized text

Evaluation D base distributed first, D test subsequently with 50 independent sets of permalinks D base, D test division will mimic how blog search engines operate –Build models to detect splogs – using individual posts, feeds or blog homepages of what is seen –Detect spam in an incoming stream of new blog postings Teams will be judged by how well they detect “spamminess” for new posts

Input/Output {set Q0 docno rank prob runtag} Individual set of test input. 1 or y such sets can be used, with each set biased to a specific splog genre, blog Publishing host or TLD Each permalink to be judged by participants Output format

Summary Spam Blogs present a major challenge to the quality of blog mining/analytics Splog Detection is different from spam in other communication platforms Development of TREC Task will help furthering state of the art Task requirements can be easily aligned with existing task of opinion identification