Presentation is loading. Please wait.

Presentation is loading. Please wait.

REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS.

Similar presentations


Presentation on theme: "REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS."— Presentation transcript:

1 REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS

2 How hard can it be to identify an individual across sites? Privacy Experts Claim Advertisers Know a lot about People Can they stop showing you the same repetitive ads across sites?

3 More information about individuals Many social media sites Partial Information Complementary Information Better User Profiles Facebook Google+ Age Location Education Huan Liu N/A USA USC (1985-89) Can we connect individuals across sites? Connectivity is not available Consistency in Information Availability

4 Can we verify that the information provided across sites belong to the same individual?

5 MO deling B ehavior for I dentifying U sers across S ites Human behavior generates Information redundancy Information shared across sites provides a behavioral fingerprint MOBIUS - Behavioral Modeling - Minimum Information

6 Identification Function Minimum information available on ALL sites:Usernames Candidate Username (john.smith) Prior Usernames ({jsmith, john.s})

7 Behavior 1 Behavior 2 Behavior n Information Redundancy Feature Set 1 Feature Set 2 Feature Set n Generates Captured Via Learning Framework Data Identification Function

8

9 59% of individuals use the same username

10 Identifying individuals by their vocabulary size Alphabet Size is correlated to language: शमंत कुमार -> Shamanth Kumar

11 QWERTY Keyboard Variants: AZERTY, QWERTZ DVORAK Keyboard Keyboard type impacts your usernames

12 : N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish : N-gram statistical language detector for 21 European Languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene, and Swedish Usernames of individuals follow a language distribution European Parliament Parallel Corpus- 40m words per language

13 Kalambo To avoid redundancy we can use username with maximum entropy

14 Adding Prefixes/Suffixes, Abbreviating, Swapping or Adding/Removing Characters Nametag and Gateman Usernames come from a language model

15 Data: 200,000 instances (50% class balance) 414 Features Previous Methods: 1) Zafarani and Liu, 2009 2) Perito et al., 2011 Baselines: 1) Exact Username Match 2) Substring Match 3) Patterns in Letters

16

17

18

19 Discover applications of connecting users across sites Information shared across sites acts as a behavioral fingerprint Human Behavior Results in Information Redundancy Incorporating features indigenous to specific sites A methodology for connecting individuals across sites  A behavioral modeling approach  Uses minimum information across sites  Allows for integration of additional behaviors when required


Download ppt "REZA ZAFARANI AND HUAN LIU DATA MINING AND MACHINE LEARNING LABORATORY (DMML) ARIZONA STATE UNIVERSITY KDD 2013 – CHICAGO, ILLINOIS."

Similar presentations


Ads by Google