Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Sensor-Based Abnormal Human-Activity Detection Authors: Jie Yin, Qiang Yang, and Jeffrey Junfeng Pan Presenter: Raghu Rangan.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Date : 2014/04/01 Author : Zhung-Xun Liao, Yi-Chin Pan, Wen-Chih Peng, Po-Ruey Lei Source : CIKM’13 Advisor : Jia-ling Koh Speaker : Shao-Chun Peng.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Unsupervised Transfer Classification Application to Text Categorization Tianbao Yang, Rong Jin, Anil Jain, Yang Zhou, Wei Tong Michigan State University.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Connecting Users across Social Media Sites: A Behavioral-Modeling Approach Jingchi Zhang.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Presented by Zeehasham Rasheed
Author Identification for LiveJournal Alyssa Liang.
1 Unsupervised Discovery of Morphemes Presented by: Miri Vilkhov & Daniel Feinstein linja-autonautonkuljettajallakaan linja-auton auto kuljettajallakaan.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Typewriter Keyboards via Simulated Annealing Reporter: En-ping Su Date:
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date : 2014/05/13 Source : CIKM’13 Advisor : Prof. Jia-Ling, Koh Speaker : Yi-Hsuan.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.
Using Social Networks to Harvest Addresses Reporter: Chia-Yi Lin Advisor: Chun-Ying Huang Mail: 9/14/
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.
Jhih-sin Jheng 2009/09/01 Machine Learning and Bioinformatics Laboratory.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Date : 2014/01/14 Author : Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas Source : CIKM’13 Advisor : Jia-ling Koh Speaker : Shao-Chun Peng.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Prediction of Influencers from Word Use Chan Shing Hei.
Spreadsheet Vocabulary Terms
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Intelligent Key Prediction by N-grams and Error-correction Rules Kanokwut Thanadkran, Virach Sornlertlamvanich and Tanapong Potipiti Information Research.
Comparative Experiments on Sentiment Classification for Online Product Reviews Hang Cui, Vibhu Mittal, and Mayur Datar AAAI 2006.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Presentation for CDA6938 Network Security, Spring 2006 Timing Analysis of Keystrokes and Timing Attacks on SSH Authors: Dawn Xiaodong Song, David Wagner,
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
LEARNING FROM THE PAST: ANSWERING NEW QUESTIONS WITH PAST ANSWERS Date: 2012/11/22 Author: Anna Shtok, Gideon Dror, Yoelle Maarek, Idan Szpektor Source:
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida Universidade Federal de Minas Gerais Belo Horizonte, Brazil ACSAC 2010 Fabricio.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Connecting Users across Social Media Sites: A Behavioral- Modeling Approach Reza Zafarani and Huan Liu KDD’13 Presenter: Changqing Luo, Zhihao Cao, and.
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.
Keyboard Finger Placement Exercise
A Simple Approach for Author Profiling in MapReduce
Presented by Wanxue Dong
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Presentation transcript:

10 Bits of Surprise: Detecting Malicious Users with Minimum Information Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15 Advisor: Jia-ling Koh Spearker: LIN,CI-JIE

Outline Introduction Method Experiment Conclusion

Introduction Malicious users are a threat to many sites and defending against them demands innovative countermeasures In June 2012, Facebook reported that 8.7% of its user accounts are fake Twitter claims that 5% of its users are fake

Challenges Malicious users need to be detected using their often limited content or link (i.e. friends) information Existing techniques assume that a good amount of information about malicious users has been gathered

Introduction Goal: develop a methodology that identifies malicious users with limited information using usernames as the minimum information available on all sites

Outline Introduction Method Experiment Conclusion

Malicious Users are Complex and Diverse Malicious users often generate (1) complex and (2) diverse information to ensure their anonymity

Complexity Information surprise For a rare username u with a small observation probability p(u), information surprise I(u) is much higher than that of a common username To estimate the probability of a username using an n-gram model

Diversity The number of digits in the username The proportion of digits in the username

Malicious Users are Demographically Biased Malicious User Gender decomposes a username into character n-grams estimates the gender likelihood based on these n-grams the classifier's confidence in the predicted gender as the feature Malicious User Language detect the language of the username train an n-gram statistical language detector over the European Parliament Proceedings Parallel Corpus, which consists of text in 21 European languages the alphabet distribution of the username

Malicious Users are Demographically Biased Malicious User Knowledge the vocabulary size can be computed by counting the number of words in a large dictionary that are substrings of the username

Malicious Users are Anonymous entropy of the alphabet distribution of the username normalize entropy the number of unique letters used in the username divided by the username length

Malicious Users are Similar Language Patterns the normalized character-level bigrams of usernames the number of digits at the beginning of the username the maximum number of times a letter has been repeated in the username

Malicious Users are Similar Word Patterns two dictionaries, one containing keywords related to malicious activities and the other for offensive key-words count the number of words in the dictionary that appear as the substring of the username

Malicious Users are Efficient the username length the number of unique alphabet letters in usernames

Malicious Users are Efficient DVORAK and QWERTY keyboards The percentage of keys typed using the same hand that was used for the previous key The percentage of keys typed using the same finger that was used for the previous key The percentage of keys typed using each finger The percentage of keys pressed on rows: Top Row, Home Row, Bottom Row, and Number Row The approximate distance (in meters) traveled for typing a username

Malicious user detection with minimum information Supervised learning can be performed using classification or regression

Outline Introduction Method Experiment Conclusion

DataSet Malicious Users (negative examples) Collect malicious usernames from sites such as dronebl.org, ahbl.org 32 million usernames Normal Users (positive examples) Collect normal users usernames from Twitter 45,953 usernames Facebook Users (positive+negative examples) 158 million usernames Facebook expects around 1.5% to be malicious Gender Dataset 4 million Facebook usernames for which we have the gender information

Outline Introduction Method Experiment Conclusion

Conclusion Introduced a methodology that can identify malicious users with minimum information Identify five general characteristics of malicious users

Thanks for listening