Download presentation
Presentation is loading. Please wait.
Published byNathaniel Haynes Modified over 8 years ago
1
Connecting Users across Social Media Sites: A Behavioral- Modeling Approach Reza Zafarani and Huan Liu KDD’13 Presenter: Changqing Luo, Zhihao Cao, and Junqi Ma
2
Problem: New York Times Reported: Skout, a mobile social networking app, discovered that, within two weeks, three adults had masqueraded as 13- to 17-year olds. In three separate incidents, they contacted children and, the police say, assaulted them. Age verification is very important for detecting the age inconsistence problem, but hard for social media which has a degree of anonymity. One way to detect the inconsistency is to start connecting the different identities of a user across social media sites
3
Related Work: Identifying Content Authorship In reference [1], they propose a method for detecting pages created by the same individual across different collections of documents. Normalized Compression Distance (NCD) However, it is common to assume large collections of documents, whereas for usernames, the information available is limited to one word.
4
Related Work: User Identification on One Site Reference [2] presents a deanonymization technique to perform user identification on one media site. One can identify individuals in these anonymized networks by either manipulating networks before they are anonymized or by having prior knowledge about certain anonymized nodes and given little information about an individual one can easily identify the individual’s record in the dataset. However, we need the method that can deal with user identification over multiple sites. Moreover, we should avoid using link information, which is not always available on different social media sites
5
Proposed Solution: Unique behaviors due to environment, personality, or even human limitations can create redundant information across social media sites. Thus, this paper proposes a method called MOBIUS which can exploit such redundancies to identify users across social media sites via considering the behavioral patterns and features.
6
Problem Statement Connectivity among user identities across different sites is often unavailable. 1.People can freely choose their usernames. 2. Different websites employ different user-naming and authentication systems Although other profile attributes, such as gender, location, interests, profile pictures, language, etc, should help better identify individuals, there exists lack of consistency in the available information across all social media. Usernames are the minimum common factor available on all social media sites and atomic entities. Therefore, this paper analyzes the behavioral patterns and features from the usernames.
7
Problem Statement Definition: Given a set of n usernames (prior usernames) U = {u1, u2,..., un}, owned by individual I and a candidate username c, a user identification procedure attempts to learn an identification function f(.,.) such that:
8
MOBIUS: Modeling Behavior for Identifying Users across Sites When individuals select usernames, they exhibit certain behavioral patterns. This often leads to information redundancy, helping learn the identification function. We can learn an identification function by employing a supervised learning framework that utilizes these features and prior information (labeled data). Depending on the learning framework, one can even learn the probability that an individual owns the candidate username, generalizing our binary f function to a probabilistic model (f(U,c) = p). This probability can help select the most likely individual who owns the candidate username.
9
MOBIUS: Modeling Behavior for Identifying Users across Sites
10
Behavioral patterns and features
11
Patterns due to Human Limitations
12
Limitations in Time and Memory Selecting the same usernames: Reference[19] shows that 59% of individuals prefer to use the same usernames repeatedly. Consider the number of times candidate username c is repeated in prior usernames as a feature. Username Length Likelihood: Users commonly have a limited set of potential usernames from which they select one once asked to create a new username These usernames have different lengths, i.e., And as a result, a length distribution, i.e.,
13
Limitations in Time and Memory Unique Username Creation Likelihood: Users often prefer not to create new usernames. The effort that users are willing to put into creating new usernames can be approximately represented by the number of unique usernames (uniq(U)) among prior usernames U:
14
Knowledge Limitation Limited vocabulary: The individual’s vocabulary size in a language is considered as a feature for identifying them. Limited Alphabet: The alphabet letters used in the usernames are highly dependent on language. The number of alphabet letters used as a feature, both for the candidate username and prior usernames.
15
Exogenous factors Cultural influences or environment that the user is living in
16
Typing Patterns Keyboard is considered as a general constraint imposed by the environment. Reference [4] shows that the layout of the keyboard significantly affects how random usernames are selected.
17
Typing Patterns To capture keyboard-related regularities, this paper constructs 15 features for each keyboard layout. (1 feature): The percentage of keys typed using the same hand used for the previous key (1 feature): Percentage of keys typed using the same finger used for the previous key (8 features): The percentage of keys typed using each finger. (4 features): The percentage of keys pressed on rows: top row, home row, bottom row, and number row (1 feature): The approximate distance traveled for typing a username.
18
Language Patterns The language is one of the cultural priors Users often use the same or the same set of languages when selecting usernames. In MOBIUS, the language of the username is considered as a feature in the dataset. To detect the language, this paper trained an n-gram statistical language detector over the European Parliament Proceedings Parallel Corpus (dataset) and obtain the language distribution.
19
Endogenous factor - Habits “Old habits, die hard” The habits in creating usernames includes: Username modification Generating similar usernames Username observation likelihood
20
Username modification Individuals often select new usernames by changing their previous usernames Adding prefixes or suffixes. For example: mark.brown -->mark.brown2008 Abbreviating their usernames. For example: ivan.sears isears Changing or adding characters: e.g., beth.smith b3th.smith How to capture the modification ? To detect added pre/suf-fix: To check if one username is the substring of the other. For detecting abbreviation: using Longest Common Subsequence length and performing a pair-wise calculation of it between the candidate username and prior usernames. For swapped letters and added letters: using normalized and unnormalized versions of both Edit distance and dynamic time warping distance.
21
Generating similar usernames Users tend to generate similar usernames It is hard to capture the similarity between usernames by using the previously discussed methods. For example, gateman and nametag This paper compares the candidate username and prior usernames using Jensen-Shannon divergence (JS). where
22
Username Observation Likelihood Given the prior knowledge of usernames, we can estimate the probability of observing candidate username. The probability of observing username u, denoted in characters as u = c 1 c 2 …c n, is
23
EXPERIMENTS - Data Preparation Social Networking Sites: On most social networking sites such as Google+ or Facebook, users can list their IDs on other sites. Blogging and Blog Advertisement Portals: Individuals often join blog cataloging sites to list not only blogs, but also their profiles on other sites. Forums: Many forums use generic Content Management Systems. These applications usually allow users to add their usernames on social media sites to their profiles. Overall, 100,179 (c-U) pairs are collected from 32 sites as positive instances, where c is a username and U is the set of prior usernames. Construct negative instances by randomly creating pairs (c i -U j ), such that ci is from one positive instance and U j is from a different positive instance (i != j) to guarantee that they are not from the same individual.
24
EXPERIMENTS - compare MOBIUS with other methods Naive Bayes are used as the probabilistic classifer. Two method mentioned in early papers and three baselines MOBIUS performance best.
25
EXPERIMENTS - compare learning algorithm Results are not significantly different among classification methods. It shows the result is not sensitive to the learning algorithm when sufficient information is avaible in features.
26
EXPERIMENTS - feature importance analysis Classification use only top ten features could provide an accuracy of 92.72%. Also notice that in the ranked features, Numbers are average higer than English alphabet letters, non-English alphabet letters or special characters have higher odds-ratios on average. Top ten important features
27
EXPERIMENTS - diminishing returns for adding more usernames and more features The identification accuracy shows a monotonically increasing trend. Even for a single prior username, the identification is 90.72% accurate.
28
EXPERIMENTS - diminishing returns for adding more usernames and more features A power function fits to the curve. Improvement becomes marginal as the username number increase and is negligible for more than 7 usernames. To analyze how adding features correlates with adding prior usernames. A power function fits to the curve.
29
EXPERIMENTS - diminishing returns for adding more usernames and more features Let f(n,k) denote the performance of our method for n usernames and k features. The δ function is a finite difference approximation for the derivative ratio with respect to n and k. When δ (n, k) > 1, adding usernames improves performance more and when δ (n, k) < 1, adding features is better. We observe that for small values of n and k, i.e., when fewer usernames and features are available, features help best, but for all other cases adding usernames is more beneficial.
30
CONCLUSIONS Demonstrate a methodology based on behavioral for connecting individuals across social media(MOBIUS), which employs minimal information available on all social media sites(usernames). This principled, behavioral modeling approach performance better than earlier methods. Features can be selected based on particular application needs. Adding more features and usernames can further improve learning performance but with diminishing returns.
31
Discussions Duplicate names are very common among people. For exmple, two person named “John”, one has username “John2000”, another has username“John200”, it is probability that they are identified as the same person which is wrong, so if the dataset contains lots of people with duplicate names, to some extent, it may effect the accuracy. The paper does not touch this problem. Second, it is possible for us having the same real names to potentially choose the similar usernames for social media. As shown by the result in experiment, given only one single prior username, the accuracy can achieve about 90%. We highly doubt this result when we consider the situate that some people have duplicate names.
32
Discussions The length of the username is generally very short, and thus it may not be possible for MOBIUS to correctly obtain the complete behavioral patterns and features. Moreover, people normally do not surface over many different social media sites and thus have a lot of usernames. As a result, the number of usernames in the prior username set is limited. Perhaps, we may not obtain the complete features. In addition, this paper does not provide us the information about the prior username set. It is difficult to form this prior username set. If some mistakes are made in forming the prior username set, the derived results from MOBIUS are not correct as well.
33
Thank you! Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.