NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08.

Slides:



Advertisements
Similar presentations
BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.
Advertisements

Research Skills AIH 2020 Dr Janette Martin & Dr Pat Hill 13 Feb 2012.
What is WEB SPAM Many slides from a lecture by Marc Najork, Microsoft: “Detecting Spam Web Pages”
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Ensembles in Adversarial Classification for Spam Deepak Chinavle, Pranam Kolari, Tim Oates and Tim Finin University of Maryland, Baltimore County Full.
Search Engines and Information Retrieval
Web development  World Wide Web (web) is the Internet system for hypertext linking.  A hypertext document (web page) is an online document. It contains.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Live Web Search Mary Hodder CEO: Bloqx, Inc. Blog: napsterization (napsterization.org/stories)
UMBC AN HONORS UNIVERSITY IN MARYLAND Increasing Research Visibility on the Web Building a presence on the Web for you and your research Tim Finin UMBC,
Todd Friesen April, 2007 SEO Workshop Web 2.0 Expo San Francisco.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
Lecturer: Ghadah Aldehim
Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin Partially supported by NSF award ITR-IIS and ITR-IDM and.
GONE PHISHING ECE 4112 Final Lab Project Group #19 Enid Brown & Linda Larmore.
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
Fast Webpage classification using URL features Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi Conference: ICIKM 2005 Reporter: Yi-Ren Yeh.
BLACK HAT SEO "Show Me The Money”. Keyword Selection.
RSS Feeds What, Why, & How… …without a CMS Don Parsons
June 14, 2005 uPortal Summer Conference, Baltimore, MD John Fereira, Cornell University Andrew Petro, Yale University uPortal Documentation Roadmap.
Web 2.0: Concepts and Applications 2 Publishing Online.
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
UMBC an Honors University in Maryland Characterizing the Splogosphere Tim Finin Pranam Kolari, Akshay Java.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
The Internet 8th Edition Tutorial 4 Searching the Web.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
NTU Natural Language Processing Lab. 1 Investment and Attention in the Weblog Community Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen.
NTU Natural Language Processing Lab. 1 An Analysis of Effectiveness of Tagging in Blogs Christopher H. Brooks and Nancy Montanez University of San Francisco.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
PART 1: INTRODUCTION TO BLOG Instructor: Mr Rizal Arbain FB:Facebook/rizal.arbain Website: H/P: Ibnu.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
© 2006 Nielsen BuzzMetrics, A VNU business affiliate Natalie Glance Senior Research Scientist Nielsen BuzzMetrics.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin.
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
What is WEB SPAM Many slides are from a lecture by Marc Najork: “Detecting Spam Web Pages”
1 Blog Cascade Affinity: Analysis and Prediction 2009 ACM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
Creating Website Using FrontPage 2003 By Heidi Lee.
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi Computational Approaches to Analyzing Weblogs,
Web Analytics & Social Media Monitoring Assignment Briefing June and September 2013 Clive Whysall CAM Examiner.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Modeling Influence Opinions and Structure in Social Media
Yu-Ru Lin, Wen-Yen Chen, Xiaolin Shi, Richard Sia, Siaodan Song,
A Machine Learning Approach
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Proposed Modification to the Draft ad Selection Procedure
Wikitology Wikipedia as an Ontology
Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals Wikis are collections of searchable,
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
INFS 230 L Internet Technology
Query Type Classification for Web Document Retrieval
Preference Based Evaluation Measures for Novelty and Diversity
Presentation transcript:

NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08 P. Kolari, T. Finin, A. Java and J. Mayfield University of Maryland Baltimore Country and Johns Hopkins University Applied Physics Laboratory

2 NTU Natural Language Processing Lab. Outline Introduction Splog Detection Problem Detecting Splogs TREC Blog Track 2006 Splog Task Assessment Conclusion

3 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007

4 NTU Natural Language Processing Lab. Introduction Spam blogs or splogs refer to blogs created for the sole purpose of hosting ads, promoting page rank of affiliates and getting new content indexed. This open task submission details How splogs impact Opinion Identification. Proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

5 NTU Natural Language Processing Lab. Splog Detecting Problem A Post from Splog: 1.Display ads in high paying contexts. 2.Features content plagiarized ( 抄襲 ) from other blogs. 3.Hosts hyperlinks that create link farms. Splog Detection is a classification problem within the blogosphere subset B. B A : represents all authentic content B S : represents content from splogs B U : represents those blog pages for which a judgment of authenticity or spam has not yet been made

6 NTU Natural Language Processing Lab

7 BABA BSBS BUBU B

8 Detecting Splogs All models are based on SVMs Words (bag-of-words) –Ex: “I”, “We”, “my”, “what”  authentic blog Word N-Gram –Ex: “comments-off”, “in-uncategorized”  splog –Ex: “2-comments”, “1-comments”, “I have”, “to my”  authentic blog Tokenized Anchors –Anchor text: anchor text –“comment”, “flickr”  authentic blog

9 NTU Natural Language Processing Lab. Tokenized URLs –Point to “.info” domain  splog –Point to “flickr”, “technorati” and “feedster”  authentic blog Global Models –Authentic blogs are very unlikely to link to splogs. –Splogs frequently do link to other splogs. Other Techniques –Ping server –Url/IP blacklists

10 NTU Natural Language Processing Lab. TREC Blog Track feeds from splogs, contributing 15.8% of the documents. The number of splogs present varies since splogs are query dependent.

11 NTU Natural Language Processing Lab. Cholesterol( 膽固醇 ) Hybrid cars

12 NTU Natural Language Processing Lab. Splog Task Assessment The classification of splogs: Non-blog Keyword-stuffing Post-stitching Post-plagiarism Post-weaving Link-spam

13 NTU Natural Language Processing Lab. Non-blog

14 NTU Natural Language Processing Lab. Keyword-stuffing

15 NTU Natural Language Processing Lab. Post-stitching

16 NTU Natural Language Processing Lab. Post-plagiarism

17 NTU Natural Language Processing Lab. Post-weaving

18 NTU Natural Language Processing Lab. Link-spam

19 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007