REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.

REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS

How hard can it be to identify an individual across sites? Privacy Experts Claim Advertisers Know a lot about People Can they stop showing you the same repetitive ads across sites?

More information about individuals Many social media sites Partial Information Complementary Information Better User Profiles Facebook Google+ Age Location Education Huan Liu N/A USA USC (1985-89) Can we connect individuals across sites? Connectivity is not available Consistency in Information Availability

Can we verify that the information provided across sites belong to the same individual?

MO deling B ehavior for I dentifying U sers across S ites Human behavior generates Information redundancy Information shared across sites provides a behavioral fingerprint MOBIUS - Behavioral Modeling - Minimum Information

Identification Function Minimum information available on ALL sites:Usernames Candidate Username (john.smith) Prior Usernames ({jsmith, john.s})

Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Generates Captured Via Learning Framework Data Identification Function

59% of individuals use the same username

Identifying individuals by their vocabulary size Alphabet Size is correlated to language: शमंत कुमार -> Shamanth Kumar

QWERTY Keyboard Variants: AZERTY, QWERTZ DVORAK Keyboard Keyboard type impacts your usernames

: N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish : N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish Usernames of individuals follow a language distribution European Parliament Parallel Corpus- 40m words per language

Kalambo To avoid redundancy we can use username with maximum entropy

Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Nametag and Gateman Usernames come from a language model

Data: 200,000 instances (50% class balance) 414 Features Previous Methods: 1) Zafarani and Liu, 2009 2) Perito et al., 2011 Baselines: 1) Exact Username Match 2) Substring Match 3) Patterns in Letters

Discover applications of connecting users across sites Information shared across sites acts as a behavioral fingerprint Human Behavior Results in Information Redundancy Incorporating features indigenous to specific sites A methodology for connecting individuals across sites  A behavioral modeling approach  Uses minimum information across sites  Allows for integration of additional behaviors when required

REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.

Similar presentations

Presentation on theme: "REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.

Similar presentations

Presentation on theme: "REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS."— Presentation transcript:

Similar presentations

About project

Feedback