Download presentation
Presentation is loading. Please wait.
Published byLeonard Byrd Modified over 8 years ago
1
NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08 P. Kolari, T. Finin, A. Java and J. Mayfield University of Maryland Baltimore Country and Johns Hopkins University Applied Physics Laboratory
2
2 NTU Natural Language Processing Lab. Outline Introduction Splog Detection Problem Detecting Splogs TREC Blog Track 2006 Splog Task Assessment Conclusion
3
3 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007
4
4 NTU Natural Language Processing Lab. Introduction Spam blogs or splogs refer to blogs created for the sole purpose of hosting ads, promoting page rank of affiliates and getting new content indexed. This open task submission details How splogs impact Opinion Identification. Proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.
5
5 NTU Natural Language Processing Lab. Splog Detecting Problem A Post from Splog: 1.Display ads in high paying contexts. 2.Features content plagiarized ( 抄襲 ) from other blogs. 3.Hosts hyperlinks that create link farms. Splog Detection is a classification problem within the blogosphere subset B. B A : represents all authentic content B S : represents content from splogs B U : represents those blog pages for which a judgment of authenticity or spam has not yet been made
6
6 NTU Natural Language Processing Lab. 1 2 3
7
7 BABA BSBS BUBU B
8
8 Detecting Splogs All models are based on SVMs Words (bag-of-words) –Ex: “I”, “We”, “my”, “what” authentic blog Word N-Gram –Ex: “comments-off”, “in-uncategorized” splog –Ex: “2-comments”, “1-comments”, “I have”, “to my” authentic blog Tokenized Anchors –Anchor text: anchor text –“comment”, “flickr” authentic blog
9
9 NTU Natural Language Processing Lab. Tokenized URLs –Point to “.info” domain splog –Point to “flickr”, “technorati” and “feedster” authentic blog Global Models –Authentic blogs are very unlikely to link to splogs. –Splogs frequently do link to other splogs. Other Techniques –Ping server –Url/IP blacklists
10
10 NTU Natural Language Processing Lab. TREC Blog Track 2006 17969 feeds from splogs, contributing 15.8% of the documents. The number of splogs present varies since splogs are query dependent.
11
11 NTU Natural Language Processing Lab. Cholesterol( 膽固醇 ) Hybrid cars
12
12 NTU Natural Language Processing Lab. Splog Task Assessment The classification of splogs: Non-blog Keyword-stuffing Post-stitching Post-plagiarism Post-weaving Link-spam
13
13 NTU Natural Language Processing Lab. Non-blog
14
14 NTU Natural Language Processing Lab. Keyword-stuffing
15
15 NTU Natural Language Processing Lab. Post-stitching
16
16 NTU Natural Language Processing Lab. Post-plagiarism
17
17 NTU Natural Language Processing Lab. Post-weaving
18
18 NTU Natural Language Processing Lab. Link-spam
19
19 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.