Presentation is loading. Please wait.

Presentation is loading. Please wait.

NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08.

Similar presentations


Presentation on theme: "NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08."— Presentation transcript:

1 NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08 P. Kolari, T. Finin, A. Java and J. Mayfield University of Maryland Baltimore Country and Johns Hopkins University Applied Physics Laboratory

2 2 NTU Natural Language Processing Lab. Outline Introduction Splog Detection Problem Detecting Splogs TREC Blog Track 2006 Splog Task Assessment Conclusion

3 3 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007

4 4 NTU Natural Language Processing Lab. Introduction Spam blogs or splogs refer to blogs created for the sole purpose of hosting ads, promoting page rank of affiliates and getting new content indexed. This open task submission details How splogs impact Opinion Identification. Proposes an approach to assessment and evaluation for a Spam Blog Classification task in 2007.

5 5 NTU Natural Language Processing Lab. Splog Detecting Problem A Post from Splog: 1.Display ads in high paying contexts. 2.Features content plagiarized ( 抄襲 ) from other blogs. 3.Hosts hyperlinks that create link farms. Splog Detection is a classification problem within the blogosphere subset B. B A : represents all authentic content B S : represents content from splogs B U : represents those blog pages for which a judgment of authenticity or spam has not yet been made

6 6 NTU Natural Language Processing Lab. 1 2 3

7 7 BABA BSBS BUBU B

8 8 Detecting Splogs All models are based on SVMs Words (bag-of-words) –Ex: “I”, “We”, “my”, “what”  authentic blog Word N-Gram –Ex: “comments-off”, “in-uncategorized”  splog –Ex: “2-comments”, “1-comments”, “I have”, “to my”  authentic blog Tokenized Anchors –Anchor text: anchor text –“comment”, “flickr”  authentic blog

9 9 NTU Natural Language Processing Lab. Tokenized URLs –Point to “.info” domain  splog –Point to “flickr”, “technorati” and “feedster”  authentic blog Global Models –Authentic blogs are very unlikely to link to splogs. –Splogs frequently do link to other splogs. Other Techniques –Ping server –Url/IP blacklists

10 10 NTU Natural Language Processing Lab. TREC Blog Track 2006 17969 feeds from splogs, contributing 15.8% of the documents. The number of splogs present varies since splogs are query dependent.

11 11 NTU Natural Language Processing Lab. Cholesterol( 膽固醇 ) Hybrid cars

12 12 NTU Natural Language Processing Lab. Splog Task Assessment The classification of splogs: Non-blog Keyword-stuffing Post-stitching Post-plagiarism Post-weaving Link-spam

13 13 NTU Natural Language Processing Lab. Non-blog

14 14 NTU Natural Language Processing Lab. Keyword-stuffing

15 15 NTU Natural Language Processing Lab. Post-stitching

16 16 NTU Natural Language Processing Lab. Post-plagiarism

17 17 NTU Natural Language Processing Lab. Post-weaving

18 18 NTU Natural Language Processing Lab. Link-spam

19 19 NTU Natural Language Processing Lab. Conclusion This paper: proposes a spam blog classification task for TREC Blog Track 2007 argues why it forms an important part of blog analytics surveys existing techniques on eliminating them. shows how it impacted the primary task of TREC Blog Track 2006 puts forward assessment and evaluation for such a task to be adopted in TREC Blog Track 2007


Download ppt "NTU Natural Language Processing Lab. 1 Blog Track Open Task: Spam Blog Classification Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen Date: 2007/01/08."

Similar presentations


Ads by Google