Connecting Users across Social Media Sites: A Behavioral- Modeling Approach Reza Zafarani and Huan Liu KDD’13 Presenter: Changqing Luo, Zhihao Cao, and.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Connecting Users across Social Media Sites: A Behavioral-Modeling Approach Jingchi Zhang.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
1 Validation and Verification of Simulation Models.
Establishing Pairwise Keys in Distributed Sensor Networks Donggang Liu, Peng Ning Jason Buckingham CSCI 7143: Secure Sensor Networks October 12, 2004.
Radial Basis Function Networks
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Respected Professor Kihyeon Cho
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.
Inductive learning Simplest form: learn a function from examples
Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.
FYP Presentation DATA FUSION OF CONSUMER BEHAVIOR DATASETS USING SOCIAL MEDIA Madhav Kannan A R 1.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:
The Scientific Method Honors Biology Laboratory Skills.
Bug Localization with Machine Learning Techniques Wujie Zheng
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.
Distributed Maintenance of Cache Freshness in Opportunistic Mobile Networks Wei Gao and Guohong Cao Dept. of Computer Science and Engineering Pennsylvania.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
First topic: clustering and pattern recognition Marc Sobel.
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Date: 2015/11/19 Author: Reza Zafarani, Huan Liu Source: CIKM '15
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
Click to Add Title A Systematic Framework for Sentiment Identification by Modeling User Social Effects Kunpeng Zhang Assistant Professor Department of.
Privacy-preserving data publishing
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Scatter Plots Scatter plots are a graphic representation of collated biviariate data via a mathematical diagram using Cartesian coordinates. The data.
Classification Ensemble Methods 1
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
3/14/20161 SOAR CIS 479/579 Bruce R. Maxim UM-Dearborn.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Machine Learning: Ensemble Methods
Logical Database Design and the Rational Model
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Timing Analysis of Keystrokes and Timing Attacks on SSH
iSRD Spam Review Detection with Imbalanced Data Distributions
Presentation transcript:

Connecting Users across Social Media Sites: A Behavioral- Modeling Approach Reza Zafarani and Huan Liu KDD’13 Presenter: Changqing Luo, Zhihao Cao, and Junqi Ma

Problem:  New York Times Reported: Skout, a mobile social networking app, discovered that, within two weeks, three adults had masqueraded as 13- to 17-year olds. In three separate incidents, they contacted children and, the police say, assaulted them.  Age verification is very important for detecting the age inconsistence problem, but hard for social media which has a degree of anonymity. One way to detect the inconsistency is to start connecting the different identities of a user across social media sites

Related Work:  Identifying Content Authorship  In reference [1], they propose a method for detecting pages created by the same individual across different collections of documents.  Normalized Compression Distance (NCD)  However, it is common to assume large collections of documents, whereas for usernames, the information available is limited to one word.

Related Work:  User Identification on One Site  Reference [2] presents a deanonymization technique to perform user identification on one media site. One can identify individuals in these anonymized networks by either manipulating networks before they are anonymized or by having prior knowledge about certain anonymized nodes and given little information about an individual one can easily identify the individual’s record in the dataset.  However, we need the method that can deal with user identification over multiple sites. Moreover, we should avoid using link information, which is not always available on different social media sites

Proposed Solution:  Unique behaviors due to environment, personality, or even human limitations can create redundant information across social media sites.  Thus, this paper proposes a method called MOBIUS which can exploit such redundancies to identify users across social media sites via considering the behavioral patterns and features.

Problem Statement  Connectivity among user identities across different sites is often unavailable. 1.People can freely choose their usernames. 2. Different websites employ different user-naming and authentication systems  Although other profile attributes, such as gender, location, interests, profile pictures, language, etc, should help better identify individuals, there exists lack of consistency in the available information across all social media.  Usernames are the minimum common factor available on all social media sites and atomic entities. Therefore, this paper analyzes the behavioral patterns and features from the usernames.

Problem Statement  Definition: Given a set of n usernames (prior usernames) U = {u1, u2,..., un}, owned by individual I and a candidate username c, a user identification procedure attempts to learn an identification function f(.,.) such that:

MOBIUS: Modeling Behavior for Identifying Users across Sites  When individuals select usernames, they exhibit certain behavioral patterns. This often leads to information redundancy, helping learn the identification function.  We can learn an identification function by employing a supervised learning framework that utilizes these features and prior information (labeled data).  Depending on the learning framework, one can even learn the probability that an individual owns the candidate username, generalizing our binary f function to a probabilistic model (f(U,c) = p). This probability can help select the most likely individual who owns the candidate username.

MOBIUS: Modeling Behavior for Identifying Users across Sites

Behavioral patterns and features

Patterns due to Human Limitations

Limitations in Time and Memory  Selecting the same usernames:  Reference[19] shows that 59% of individuals prefer to use the same usernames repeatedly.  Consider the number of times candidate username c is repeated in prior usernames as a feature.  Username Length Likelihood:  Users commonly have a limited set of potential usernames from which they select one once asked to create a new username  These usernames have different lengths, i.e.,  And as a result, a length distribution, i.e.,

Limitations in Time and Memory  Unique Username Creation Likelihood:  Users often prefer not to create new usernames.  The effort that users are willing to put into creating new usernames can be approximately represented by the number of unique usernames (uniq(U)) among prior usernames U:

Knowledge Limitation  Limited vocabulary:  The individual’s vocabulary size in a language is considered as a feature for identifying them.  Limited Alphabet:  The alphabet letters used in the usernames are highly dependent on language.  The number of alphabet letters used as a feature, both for the candidate username and prior usernames.

Exogenous factors  Cultural influences or environment that the user is living in

Typing Patterns  Keyboard is considered as a general constraint imposed by the environment.  Reference [4] shows that the layout of the keyboard significantly affects how random usernames are selected.

Typing Patterns  To capture keyboard-related regularities, this paper constructs 15 features for each keyboard layout.  (1 feature): The percentage of keys typed using the same hand used for the previous key  (1 feature): Percentage of keys typed using the same finger used for the previous key  (8 features): The percentage of keys typed using each finger.  (4 features): The percentage of keys pressed on rows: top row, home row, bottom row, and number row  (1 feature): The approximate distance traveled for typing a username.

Language Patterns  The language is one of the cultural priors  Users often use the same or the same set of languages when selecting usernames.  In MOBIUS, the language of the username is considered as a feature in the dataset.  To detect the language, this paper trained an n-gram statistical language detector over the European Parliament Proceedings Parallel Corpus (dataset) and obtain the language distribution.

Endogenous factor - Habits  “Old habits, die hard”  The habits in creating usernames includes:  Username modification  Generating similar usernames  Username observation likelihood

Username modification  Individuals often select new usernames by changing their previous usernames  Adding prefixes or suffixes. For example: mark.brown -->mark.brown2008  Abbreviating their usernames. For example: ivan.sears  isears  Changing or adding characters: e.g., beth.smith  b3th.smith  How to capture the modification ?  To detect added pre/suf-fix: To check if one username is the substring of the other.  For detecting abbreviation: using Longest Common Subsequence length and performing a pair-wise calculation of it between the candidate username and prior usernames.  For swapped letters and added letters: using normalized and unnormalized versions of both Edit distance and dynamic time warping distance.

Generating similar usernames  Users tend to generate similar usernames  It is hard to capture the similarity between usernames by using the previously discussed methods.  For example, gateman and nametag  This paper compares the candidate username and prior usernames using Jensen-Shannon divergence (JS). where

Username Observation Likelihood  Given the prior knowledge of usernames, we can estimate the probability of observing candidate username.  The probability of observing username u, denoted in characters as u = c 1 c 2 …c n, is

EXPERIMENTS - Data Preparation Social Networking Sites: On most social networking sites such as Google+ or Facebook, users can list their IDs on other sites. Blogging and Blog Advertisement Portals: Individuals often join blog cataloging sites to list not only blogs, but also their profiles on other sites. Forums: Many forums use generic Content Management Systems. These applications usually allow users to add their usernames on social media sites to their profiles. Overall, 100,179 (c-U) pairs are collected from 32 sites as positive instances, where c is a username and U is the set of prior usernames. Construct negative instances by randomly creating pairs (c i -U j ), such that ci is from one positive instance and U j is from a different positive instance (i != j) to guarantee that they are not from the same individual.

EXPERIMENTS - compare MOBIUS with other methods Naive Bayes are used as the probabilistic classifer. Two method mentioned in early papers and three baselines MOBIUS performance best.

EXPERIMENTS - compare learning algorithm Results are not significantly different among classification methods. It shows the result is not sensitive to the learning algorithm when sufficient information is avaible in features.

EXPERIMENTS - feature importance analysis Classification use only top ten features could provide an accuracy of 92.72%. Also notice that in the ranked features, Numbers are average higer than English alphabet letters, non-English alphabet letters or special characters have higher odds-ratios on average. Top ten important features

EXPERIMENTS - diminishing returns for adding more usernames and more features The identification accuracy shows a monotonically increasing trend. Even for a single prior username, the identification is 90.72% accurate.

EXPERIMENTS - diminishing returns for adding more usernames and more features A power function fits to the curve. Improvement becomes marginal as the username number increase and is negligible for more than 7 usernames. To analyze how adding features correlates with adding prior usernames. A power function fits to the curve.

EXPERIMENTS - diminishing returns for adding more usernames and more features Let f(n,k) denote the performance of our method for n usernames and k features. The δ function is a finite difference approximation for the derivative ratio with respect to n and k. When δ (n, k) > 1, adding usernames improves performance more and when δ (n, k) < 1, adding features is better. We observe that for small values of n and k, i.e., when fewer usernames and features are available, features help best, but for all other cases adding usernames is more beneficial.

CONCLUSIONS Demonstrate a methodology based on behavioral for connecting individuals across social media(MOBIUS), which employs minimal information available on all social media sites(usernames). This principled, behavioral modeling approach performance better than earlier methods. Features can be selected based on particular application needs. Adding more features and usernames can further improve learning performance but with diminishing returns.

Discussions Duplicate names are very common among people. For exmple, two person named “John”, one has username “John2000”, another has username“John200”, it is probability that they are identified as the same person which is wrong, so if the dataset contains lots of people with duplicate names, to some extent, it may effect the accuracy. The paper does not touch this problem. Second, it is possible for us having the same real names to potentially choose the similar usernames for social media. As shown by the result in experiment, given only one single prior username, the accuracy can achieve about 90%. We highly doubt this result when we consider the situate that some people have duplicate names.

Discussions  The length of the username is generally very short, and thus it may not be possible for MOBIUS to correctly obtain the complete behavioral patterns and features.  Moreover, people normally do not surface over many different social media sites and thus have a lot of usernames. As a result, the number of usernames in the prior username set is limited. Perhaps, we may not obtain the complete features.  In addition, this paper does not provide us the information about the prior username set. It is difficult to form this prior username set. If some mistakes are made in forming the prior username set, the derived results from MOBIUS are not correct as well.

Thank you! Q & A