Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Large-Scale Entity-Based Online Social Network Profile Linkage.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Hongyu Gao, Tuo Huang, Jun Hu, Jingnan Wang.  Boyd et al. Social Network Sites: Definition, History, and Scholarship. Journal of Computer-Mediated Communication,
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Introduction to Machine Learning Approach Lecture 5.
EVENT IDENTIFICATION IN SOCIAL MEDIA Hila Becker, Luis Gravano Mor Naaman Columbia University Rutgers University.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Tag-based Social Interest Discovery
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
A Comparison Between Bayesian Networks and Generalized Linear Models in the Indoor/Outdoor Scene Classification Problem.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Collating Social Network Profiles. Objective 2 System.
A Language Independent Method for Question Classification COLING 2004.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Linking Organizational Social Networking Profiles PROJECT ID: H JEROME CHENG ZHI KAI (A H ) 1.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Data Mining and Decision Support
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Simple Approach for Author Profiling in MapReduce
Semi-Supervised Clustering
DATA MINING © Prentice Hall.
Personalized Social Image Recommendation
Collective Network Linkage across Heterogeneous Social Platforms
Lecture 15: Text Classification & Naive Bayes
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Lecture 9: Entity Resolution
Lecture 12: Data Wrangling
Revision (Part II) Ke Chen
Data Integration for Relational Web
Adaptive entity resolution with human computation
Revision (Part II) Ke Chen
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
Identify Different Chinese People with Identical Names on the Web
Learning Probabilistic Graphical Models Overview Learning Problems.
Presented by : SaiVenkatanikhil Nimmagadda
Data Warehousing Data Mining Privacy
Machine learning overview
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
CS639: Data Management for Data Science
Topic: Semantic Text Mining
Presentation transcript:

Large-Scale Cost-sensitive Online Social Network Profile Linkage

Background & Motivation Foot prints in different social networks. User identification in social analysis. Privacy & security Commercial & government applications

Outline Problem definition Related work Approach Experiment Conclusion & future work

Problem Definition Terminology Identity: Person Profile/User: Your footprint on social media Profile Linkage: Link your footprints together Input & Output Input: profiles of one site as QUERY and profiles of the other site as TARGET. Output: all pairs of classified matched profiles.

Characteristics of profile Name (semi vs. structured) {“given name”: “haochen”, “family name”: “zhang”} name: zhang haochen Semi-structured schema Incompleteness & missing attributes Privacy policy Virtual identification Free text description Bio, About me, Tags Multilingualism

Top 5 languages in dataset of Facebook English Portuguese Spanish Chinese French Most frequent tokens in different languages chris, john, michael chen, wang, lee carlos, garcia, daniel sergey, olga, alexander About 70% users are in English 7.2% users register as different locales Transliteration 昊辰 => Haochen

Feature Acquisition Network communication costs too much time. Usage limit of the web service invocations per day for Google Maps API Compute complexity comparing to string similarity. Image processing algorithm.

Related work User linking across the social networks Record linkage and entity resolution Cost-sensitive feature acquisition

Overview of approach Classification of Potential Links Features representation Supervised learning Cost-sensitive Feature Acquisition Pruning with Canopy Parameter tuningCanopy construction Entity-based Representation of Profiles MappingTokenizationEntity extraction

Canopy: design

Canopy: efficiency

Local Features Username Jaro Winkler Similarity Language Jaccard Simlarity Description, URL Cosine similarity with TF×IDF Popularity Defined as the friend amount of a user. Adopt following metric

External Features Geographic Location Values are diverse with different types. Google Maps API: string-represented location => geographic information Spherical distance between two locations as the feature Avatar χ 2 dissimilarity of the avatar’s gray-scale histogram.

Classification: learning Probabilistic model derived from naïve bayes Independent feature assumption

Classification: learning Iterative inference Terminate if S_n is discriminative. Set up threshold by choosing the error rate in training set of each feature to determine whether S_n is discriminative Order of the features

Classification: learning Initial value Estimate by the prior that two profiles sharing rarer tokens are more likely to be matched. as the initial value

Dataset of experiment Data source 152,294 Twitter users 154,379 LinkedIn users Ground truth: 9,750 identities 4,779 identities with both accounts. 3,339 identities with only Twitter account. 1,632 identities with only LinkedIn account.

Experiment: Performance on overall linkage I-Acc(Identity Accuracy) correctly identified identities / all identities in ground truth Better than naïve learning method caused by adopting the prior. Different performance on different learning methods.

Experiment: Cost-sensitive feature acquisition 5% improvement of F1 by taking external feature acquisitions. Different order of external features. Rank by cost Rank by distinguishability Three sections divided by two inflection points.

Discussion: dataset construction Dataset construction Connections Cannot correctly reflect the web-scale occasion. Name is too significant. People search Difficult to construct the ground truth. Solution?

Discussion: people search task Query in LinkedIn by Twitter user’s name Average 10 results for each query PreRecF1 Human NB_Local NB_All C4.5_Local C4.5_All CSPL_Local CSPL_All

Discussion: feature dependency Compare features independently. 2 people in Tsinghua with same name Li Peng 2 people in NUS with same name Li Peng Construct different IDF table for name in different locale. Not generally Not significantly effective

Conclusion We proposed an supervised probabilistic to solve the identity linkage problem effectively. Prior that users sharing rarer tokens are more likely matched improves the performance of the approach. Iterative inference is able to reduce unnecessary feature acquisitions.

Thank you