Large-Scale Cost-sensitive Online Social Network Profile Linkage
Background & Motivation Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications
Outline Problem definition Related work Approach Experiment Conclusion & future work
Problem Definition Terminology Identity: Person Profile/User: Your footprint on social media Profile Linkage: Link your footprints together Input & Output Input: profiles of one site as QUERY and profiles of the other site as TARGET. Output: all pairs of classified matched profiles.
Characteristics of profile Name (semi vs. structured) {“given name”: “haochen”, “family name”: “zhang”} name: zhang haochen Semi-structured schema Incompleteness & missing attributes Privacy policy Virtual identification Free text description Bio, About me, Tags Multilingualism
Top 5 languages in dataset of Facebook English Portuguese Spanish Chinese French Most frequent tokens in different languages chris, john, michael chen, wang, lee carlos, garcia, daniel sergey, olga, alexander About 70% users are in English 7.2% users register as different locales Transliteration 昊辰 => Haochen
Feature Acquisition Network communication costs too much time. Usage limit of the web service invocations per day for Google Maps API Compute complexity comparing to string similarity. Image processing algorithm.
Related work User linking across the social networks Record linkage and entity resolution Cost-sensitive feature acquisition
Overview of approach Classification of Potential Links Features representation Supervised learning Cost-sensitive Feature Acquisition Pruning with Canopy Parameter tuningCanopy construction Entity-based Representation of Profiles MappingTokenizationEntity extraction
Canopy: design
Canopy: efficiency
Local Features Username Jaro Winkler Similarity Language Jaccard Simlarity Description, URL Cosine similarity with TF×IDF Popularity Defined as the friend amount of a user. Adopt following metric
External Features Geographic Location Values are diverse with different types. Google Maps API: string-represented location => geographic information Spherical distance between two locations as the feature Avatar χ 2 dissimilarity of the avatar’s gray-scale histogram.
Classification: learning Probabilistic model derived from naïve bayes Independent feature assumption
Classification: learning Iterative inference Terminate if S_n is discriminative. Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative Order of the features
Classification: learning Initial value Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. as the initial value
Dataset of experiment Data source 152,294 Twitter users 154,379 LinkedIn users Ground truth: 9,750 identities 4,779 identities with both accounts. 3,339 identities with only Twitter account. 1,632 identities with only LinkedIn account.
Experiment: Performance on overall linkage I-Acc(Identity Accuracy) correctly identified identities / all identities in ground truth Better than naïve learning method caused by adopting the prior. Different performance on different learning methods.
Experiment: Cost-sensitive feature acquisition 5% improvement of F1 by taking external feature acquisitions. Different order of external features. Rank by cost Rank by distinguishability Three sections divided by two inflection points.
Discussion: dataset construction Dataset construction Connections Cannot correctly reflect the web-scale occasion. Name is too significant. People search Difficult to construct the ground truth. Solution?
Discussion: people search task Query in LinkedIn by Twitter user’s name Average 10 results for each query PreRecF1 Human NB_Local NB_All C4.5_Local C4.5_All CSPL_Local CSPL_All
Discussion: feature dependency Compare features independently. 2 people in Tsinghua with same name Li Peng 2 people in NUS with same name Li Peng Construct different IDF table for name in different locale. Not generally Not significantly effective
Conclusion We proposed an supervised probabilistic to solve the identity linkage problem effectively. Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. Iterative inference is able to reduce unnecessary feature acquisitions.
Thank you