Download presentation
Presentation is loading. Please wait.
Published byBasil Richard Modified over 9 years ago
1
REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS
2
How hard can it be to identify an individual across sites? Privacy Experts Claim Advertisers Know a lot about People Can they stop showing you the same repetitive ads across sites?
3
More information about individuals Many social media sites Partial Information Complementary Information Better User Profiles Facebook Google+ Age Location Education Huan Liu N/A USA USC (1985-89) Can we connect individuals across sites? Connectivity is not available Consistency in Information Availability
4
Can we verify that the information provided across sites belong to the same individual?
5
MO deling B ehavior for I dentifying U sers across S ites Human behavior generates Information redundancy Information shared across sites provides a behavioral fingerprint MOBIUS - Behavioral Modeling - Minimum Information
6
Identification Function Minimum information available on ALL sites:Usernames Candidate Username (john.smith) Prior Usernames ({jsmith, john.s})
7
Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Generates Captured Via Learning Framework Data Identification Function
9
59% of individuals use the same username
10
Identifying individuals by their vocabulary size Alphabet Size is correlated to language: शमंत कुमार -> Shamanth Kumar
11
QWERTY Keyboard Variants: AZERTY, QWERTZ DVORAK Keyboard Keyboard type impacts your usernames
12
: N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish : N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish Usernames of individuals follow a language distribution European Parliament Parallel Corpus- 40m words per language
13
Kalambo To avoid redundancy we can use username with maximum entropy
14
Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Nametag and Gateman Usernames come from a language model
15
Data: 200,000 instances (50% class balance) 414 Features Previous Methods: 1) Zafarani and Liu, 2009 2) Perito et al., 2011 Baselines: 1) Exact Username Match 2) Substring Match 3) Patterns in Letters
19
Discover applications of connecting users across sites Information shared across sites acts as a behavioral fingerprint Human Behavior Results in Information Redundancy Incorporating features indigenous to specific sites A methodology for connecting individuals across sites A behavioral modeling approach Uses minimum information across sites Allows for integration of additional behaviors when required
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.