REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS
How hard can it be to identify an individual across sites? Privacy Experts Claim Advertisers Know a lot about People Can they stop showing you the same repetitive ads across sites?
More information about individuals Many social media sites Partial Information Complementary Information Better User Profiles Facebook Google+ Age Location Education Huan Liu N/A USA USC ( ) Can we connect individuals across sites? Connectivity is not available Consistency in Information Availability
Can we verify that the information provided across sites belong to the same individual?
MO deling B ehavior for I dentifying U sers across S ites Human behavior generates Information redundancy Information shared across sites provides a behavioral fingerprint MOBIUS - Behavioral Modeling - Minimum Information
Identification Function Minimum information available on ALL sites:Usernames Candidate Username (john.smith) Prior Usernames ({jsmith, john.s})
Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Generates Captured Via Learning Framework Data Identification Function
59% of individuals use the same username
Identifying individuals by their vocabulary size Alphabet Size is correlated to language: शमंत कुमार -> Shamanth Kumar
QWERTY Keyboard Variants: AZERTY, QWERTZ DVORAK Keyboard Keyboard type impacts your usernames
: N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish : N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish Usernames of individuals follow a language distribution European Parliament Parallel Corpus- 40m words per language
Kalambo To avoid redundancy we can use username with maximum entropy
Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Nametag and Gateman Usernames come from a language model
Data: 200,000 instances (50% class balance) 414 Features Previous Methods: 1) Zafarani and Liu, ) Perito et al., 2011 Baselines: 1) Exact Username Match 2) Substring Match 3) Patterns in Letters
Discover applications of connecting users across sites Information shared across sites acts as a behavioral fingerprint Human Behavior Results in Information Redundancy Incorporating features indigenous to specific sites A methodology for connecting individuals across sites A behavioral modeling approach Uses minimum information across sites Allows for integration of additional behaviors when required