Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

A Vector Space Model for Automatic Indexing
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Multimedia Database Systems
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Farag Saad i-KNOW 2014 Graz- Austria,
Todays topic Social Tagging By Christoffer Hirsimaa.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Deep Belief Networks for Spam Filtering
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Chapter 5: Information Retrieval and Web Search
A.C. Chen ADL M Zubair Rafique Muhammad Khurram Khan Khaled Alghathbar Muddassar Farooq The 8th FTRA International Conference on Secure and.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Artificial Neural Networks Dr. Abdul Basit Siddiqui Assistant Professor FURC.
DETECTING SPAMMERS AND CONTENT PROMOTERS IN ONLINE VIDEO SOCIAL NETWORKS Fabrício Benevenuto ∗, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, and.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
Chapter 6: Information Retrieval and Web Search
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
BEHAVIORAL TARGETING IN ON-LINE ADVERTISING: AN EMPIRICAL STUDY AUTHORS: JOANNA JAWORSKA MARCIN SYDOW IN DEFENSE: XILING SUN & ARINDAM PAUL.
Combining Audio Content and Social Context for Semantic Music Discovery José Carlos Delgado Ramos Universidad Católica San Pablo.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia-Molina Department of Computer Science Stanford University SIGIR 2008 Presentation.
CNN-RNN: A Unified Framework for Multi-label Image Classification
Saisai Gong, Wei Hu, Yuzhong Qu
Molecular Classification of Cancer
ICCV Hierarchical Part Matching for Fine-Grained Image Classification
Collaborative Filtering Nearest Neighbor Approach
Topic Oriented Semi-supervised Document Clustering
Discriminative Frequent Pattern Analysis for Effective Classification
iSRD Spam Review Detection with Imbalanced Data Distributions
FLOSCAN: An Artificial Life Based Data Mining Algorithm
Using Uneven Margins SVM and Perceptron for IE
Web Mining Research: A Survey
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung

2 Outline Introduction Association Discovery by SOM Tag Spam Detection Experimental Results Conclusions

3 Social Bookmarking –Why? Social bookmarking services (aka folksonomy) are gaining popularity since they have the following benefits: Alleviation of efforts in Web page annotation Improvement of retrieval precision Simplification of Web page classification

How folksonomy works? Simple A user (u i ) annotates a Web page (o j ) with a set of tags (T ij ). Generally represented as a set of tuples (u i, o j, T ij ), where u i  U, o j  O, and t ij  T. 4

5 Collaboration Semantic relatedness Possibility of spam Characteristics of Folksonomy

6 Tags that are unrelated or improperly related to the content/semantics of the annotated Web pages. Arise for advertisement or promotional purposes. Misleading users and deterioration of retrieval result. Tag Spams

System Architecture 7 Web pages Tags Page/tag associations Association discovery Association discovery Preprocessing Preprocessing Web page vectors Tag vectors SOM training SOM training Page clusters Tag clusters Synaptic weight vectors Labeling Labeling

Preprocessing Bag of words approach Web page P i is transformed to a binary vector P i. T i, which is the tag list of P i, is transformed to a binary vector T i. 8

9 SOM Training All P i and T i were trained by the self- organizing map algorithm separately. Two maps M P and M T were obtained after the training.

10 Labeling We labeled each Web page on M P by finding its most similar neuron. A page cluster map (PCM) was obtained after all pages being labeled. The same approach was applied on all tag lists on M T and obtained tag cluster map (TCM).

Association Discovery Finding associations between page clusters and tag clusters. We used a voting scheme to find the associations. 11 PiPi TiTi PCMTCM +1

12 Architecture of Tag Spam Detection Incoming Web page Incoming tag list Page/tag associations Preprocessing Preprocessing Incoming page vector Incoming tag vector Labeling Labeling Labeled page cluster Labeled tag cluster Spam detection Spam detection Tag spams PCM and TCM

13 Spam Detection Two types of tag spams Document-scope detection (post-level detection) The whole tag list is identified as spam. Tag-scope detection (tag-level detection) Individual tags are identified as spams. Let P I and T I be the incoming Web page and its tag list, respectively. Let P I and T I be labeled to and, respectively.

14 Document-Scope Detection Relatedness between page cluster and tag cluster : Q: neighborhood of A = [a ij ] is the correlation matrix between PCM and TCM. a pk = 1 if and are related; otherwise a pk = 0 D: geometric distance between two clusters T I is identified as spam if

15 Tag-Scope Detection A tag is a spam if it is inconsistent to other tags in the same tag cluster. Let T i = {t ij } be a tag list and An incoming tag t Ij  T I is a spam if t Ij  W.

16 Experimental Result Dataset 1500 Web page / tag list pairs collected from each pair was inspected manually both in post- level and tag-level 583 distinct Web pages Sizes of vocabularies Web pages: tag lists: 5157 average number of tags per page: 4.7

17 Experimental Result Parameters map sizes PCM: 10  10 TCM: 10  10 training epochs PCM: 400 TCM: 200  : 0.7

18 Experimental Result Number of training / test data: 1000 / 500 Confusion matrix for document-scope detection Accuracy = ( ) / 500 = 78.2% Recall = 118 / ( ) = 72.8% Precision = 118 / ( ) = 64.5% Actual result SpamNon-spam Predicted result Spam11865 Non-spam44273

Further Result of Document-Scope Detection Result after 10-fold cross validation Confusion matrix Accuracy = ( ) / 500 = 78.8% Recall = / ( ) = 73.8% Precision = / ( ) = 66.4% 19 Actual result SpamNon-spam Predicted result Spam Non-spam

Further Result of Tag-Scope Detection Result after 10-fold cross validation Confusion matrix * average number of tags per page Accuracy = ( ) / 4.7 = 76.6% Recall = 1.4 / ( ) = 77.8% Precision = 1.4 / ( ) = 66.7% 20 Actual result SpamNon-spam Predicted result Spam1.4*0.7 Non-spam0.42.2

21 Conclusions A novel scheme for tag spam detection based on text mining. Relatedness between Web pages and tags were discovered based on self-organizing map. Use only the content of Web pages instead of user behaviors.

Thanks for your attention. 22