How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

Mining User Similarity Based on Location History Yu Zheng, Quannan Li, Xing Xie Microsoft Research Asia.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Detecting Nearly Duplicated Records in Location Datasets Microsoft Research Asia Search Technology Center Yu Zheng Xing Xie, Shuang Peng, James Fu.
Graph Analysis Matching Program Burdette Pixton. Record Linkage Object Identification Problem Identifies possible links in pedigrees Advantages Compress.
3.3 Toward Statistical Inference. What is statistical inference? Statistical inference is using a fact about a sample to estimate the truth about the.
Statistics for the Social Sciences
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Inferences About Means of Single Samples Chapter 10 Homework: 1-6.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Scalable Text Mining with Sparse Generative Models
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao Yinglian Xie *, Fang Yu *, Qifa Ke *, Yuan Yu *, Yan Chen and Eliot Gillum ‡ EECS Department,
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Dr. Hong Zhang.  Tables and Graphs  Populations and Samples  Mean, Median, and Standard Deviation  Standard Error & 95% Confidence Interval (CI) 
 Catalogue No: BS-338  Credit Hours: 3  Text Book: Advanced Engineering Mathematics by E.Kreyszig  Reference Books  Probability and Statistics by.
Inferential Statistics
Process: Create Account Record Create Account Record Process Input Calc. Process Process Output Account Record First Name Last Name Company Address.
Intelius-NYU Cold Start System Ang Sun, Xin Wang, Sen Xu, Yigit Kiran, Shakthi Poornima, Andrew Borthwick (Intelius Inc.) Ralph Grishman (New York University)
Sampling Defined / The idea – Making inference about a larger population What is the population – Some particular value in the population estimating.
Graphical models for part of speech tagging
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.
7.4 – Sampling Distribution Statistic: a numerical descriptive measure of a sample Parameter: a numerical descriptive measure of a population.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Estimation This is our introduction to the field of inferential statistics. We already know why we want to study samples instead of entire populations,
Anthony J Greene1 Where We Left Off What is the probability of randomly selecting a sample of three individuals, all of whom have an I.Q. of 135 or more?
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Biostatistics Unit 5 – Samples. Sampling distributions Sampling distributions are important in the understanding of statistical inference. Probability.
BotGraph: Large Scale Spamming Botnet Detection Yao Zhao, Yinglian Xie, Fang Yu, Qifa Ke, Yuan Yu, Yan Chen, and Eliot Gillum Speaker: 林佳宜.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Credit Scoring Update CAS November 14, 2007 John Wilson.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
Towards Social User Profiling: Unified and Discriminative Influence Model for Inferring Home Locations Rui Li, Shengjie Wang, Hongbo Deng, Rui Wang, Kevin.
BPS - 3rd Ed. Chapter 191 Comparing Two Proportions.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Chapter 10 Comparing Two Treatments Statistics, 5/E by Johnson and Bhattacharyya Copyright © 2006 by John Wiley & Sons, Inc. All rights reserved.
1/53: Topic 3.1 – Models for Ordered Choices Microeconometric Modeling William Greene Stern School of Business New York University New York NY USA William.
Thanh Le, Katheleen J. Gardiner University of Colorado Denver
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
Clustering Algorithms Minimize distance But to Centers of Groups.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
[Part 5] 1/43 Discrete Choice Modeling Ordered Choice Models Discrete Choice Modeling William Greene Stern School of Business New York University 0Introduction.
Essential Statistics Chapter 191 Comparing Two Proportions.
Sampling Sampling Distributions. Sample is subset of population used to infer something about the population. Probability – know the likelihood of selection.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Data Science in Official Statistics: The Big Data Team
Data Analysis.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
8-1 of 23.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Map Reduce.
Central Limit Theorem, z-tests, & t-tests
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
STATISTICS INFORMED DECISIONS USING DATA
Ch10 Analysis of Variance.
University of Washington, Autumn 2018
Machine Learning in Practice Lecture 6
EM Algorithm and its Applications
Presentation transcript:

How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution Xin Wang, Ang Sun, Hakan Kardes, Siddharth Agrawal, Lin Chen, Andrew Borthwick

©2014 Inome, Inc. All Rights Reserved. Our Mission Gather 20 billion raw records about people Publicly Available White Page (phone records, credit card headers) Property Record Court Record (criminal, civil, marriage/divorce) Social Media News Professional Conflate all the records about the same person together Create a graph of 250 million profiles: One profile for everybody in US

©2014 Inome, Inc. All Rights Reserved.

Our Approach Formulate the problem as a Graph Partition task 7 billion nodes (each record as a node) Weights on edges are similarity scores from Machine Learning based models Cluster graph into million clusters The Challenges Most graph partition algorithm can’t be scratch to such a scale Dynamic Blocking: Iteratively divide the graph into smaller subgraphs Limited resources: 88 node hadoop cluster for multiple monthly builds Number of clusters in a sub graph unknown People records are ambiguous by nature

©2014 Inome, Inc. All Rights Reserved. Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY Low probability for two records with a common name in a big city to be about the same person

©2014 Inome, Inc. All Rights Reserved. Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson Index Elementary School, Index, WA Patricia Johnson 402 5th St, Index, WA Two records with a Common name in a small town are more likely to be about the same person

©2014 Inome, Inc. All Rights Reserved. Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson 402 5th St, Index WA Patricia Johnson Index Elementary School, Index, WA Patricia Johnson th St, New York, NY SE 5 th St Bellevue, WA 312 Main St, Oberlin, OH DOB: 1974 Worked: Inome, Inc Patricia Johnson Worked: Morgan Stanley, NY BA, Oberlin College, 96 Combining evidence from multiple locations increases the match likelihood

©2014 Inome, Inc. All Rights Reserved. Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson 402 5th St, Index WA Patricia Johnson Index Elementary School, Index, WA Patricia Johnson th St, New York, NY SE 5 th St Bellevue, WA 312 Main St, Oberlin, OH DOB: 1974 Worked: Inome, Inc Patricia Johnson Worked: Morgan Stanley, NY BA, Oberlin College, 96 Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY DOB: 05/21/1974 Incorporating other demographic information also helps with matching two records

©2014 Inome, Inc. All Rights Reserved. Approximate Match Likelihood of Two Records with Demographics Demographic information we can use: Name Frequency Population of US Population of a shared location Can be a city, zip-code, county, MSA, state, or distance based Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY Patricia Johnson Index Elementary School, Index, WA Patricia Johnson 402 5th St, Index, WA

©2014 Inome, Inc. All Rights Reserved.

Approximate Match Likelihood of Two Records with Demographics Demographic information we can use: Name Frequency Population of US Population of a shared location Can be a city, zip-code, county, MSA, state, or distance based Birthday/Age information Patricia Johnson th St, New York, NY Patricia Johnson Worked: Morgan Stanley, NY DOB: 05/21/1974

©2014 Inome, Inc. All Rights Reserved. Approximate Match Likelihood of Two Records with Demographics Patricia Johnson th St, New York, NY SE 5 th St Bellevue, WA 312 Main St, Oberlin, OH DOB: 1974 Worked: Microsoft, Redmond, WA Patricia Johnson Worked: Morgan Stanley, NY BA, Oberlin College, 96 2 Beechwood Way, Scarborough, NY Worked: IBM Armonk, NY

©2014 Inome, Inc. All Rights Reserved.

Approximate Match Likelihood of Two Records with Demographics Multiple Regions and multiple location matches in each region: Name Frequency of a Region Population of a Region State, MSA Population of a shared location Can be a city, zip-code, county, MSA, state, or distance based Birthday/Age information

©2014 Inome, Inc. All Rights Reserved. Approximate Match Likelihood of Two Records with Demographics

©2014 Inome, Inc. All Rights Reserved. How do we get the demographic statistics? 1.Population US Population State, MSA County, City, Zipcode 2.Name Frequencies US, State, MSA Different Combination of Name Components

©2014 Inome, Inc. All Rights Reserved. Data Sources and Their Record Counts

©2014 Inome, Inc. All Rights Reserved. Data Source Name Count Name Observations Source variance Source Priors True Count Name Priors Gaussian Truth Model For Estimating Name Frequencies

©2014 Inome, Inc. All Rights Reserved. Source 1 Source 2 Source 3 Source N Extract Source Name Frequency Name Freq Table 1 Name Freq Table 2 Name Freq Table 3 Name Freq Table N Normalize Name Frequency Normalized Table 1 NormalizedTab le 2 NormalizedTab le 3 NormalizedTab le N EM Algorithm to Extract Source Bias and Compute the true Name Frequency NormalizedEsti mates Denormalize Name Frequency Evaluators True Estimates Best Sources Config Name Freq Mean and Standard Error Table Implementation of GTM for Name Frequency Truth Estimation

©2014 Inome, Inc. All Rights Reserved. Contribution of the Demographic Based Likelihood Feature Name Frequency Estimates (First Last) Experimental Results:

©2014 Inome, Inc. All Rights Reserved. Q & A

©2014 Inome, Inc. All Rights Reserved. William H Gates II William H Gates William H Gates III Bill Gates 123 Main St Seattle, WA 235 NE 14 St Seattle, WA Bill and Melinda Gates Foundation William H Gates 621 Main St Bellevue WA JD, UW Harvard

©2014 Inome, Inc. All Rights Reserved. Name Address History Phone Age/DOB SSN Raw Public Record Work History Education History Text Extracts Websites Raw Social Name Features Location Features Phone Features DOB Features SSN Features Demographic Features Relative Based Features ‘Household’ based features Company-wide features Graph Based Features Fields Domain/URL Fields Education Fields Work History Fields Text Based Features Whole Field Match Other Fields N-Gram Features Gender-based Neighborhood features Multi Feature Combination (Sum/Max) Combo Features Likelihood Combo Name Birthday Population (NBP) Score Regional NBP Score Linkedin NBP Score RelatedByPhone Likelihood RelatedByAddr Likelihood Regional Population CombinedNameFreq NotSameNameAndNotSimilar Propositional Logic Combo AND, OR, NOT Propositional Logic Combo AND, OR, NOT if_[!title_whole_field_weighte d]_then_[title_correlation] ExactFLAndDifferentMiddleFe male Multi-Field N-Gram Features EmployerTagHybrid SchoolTagHybrid BlurbTagHybrid BlurbAnchorHybrid JobtitleEmployerMultiFieldHy brid Company Location Name Age Inferred Information Keywords Histograms US/Global Geo Dist/Pop Dict US/Global Name Freq Dicts Education Institute Dicts Business loc/Employee Dicts /alias Freq Dicts Data Dictionaries (Mined/Purchased) N-Gram Dictionaries Phone Freq Dicts Address Frequency Dicts Ethnicity Biological Information Case Information Criminal Records Biometrics Biometrics Features Offense Base Features

©2014 Inome, Inc. All Rights Reserved Ave NE, Bellevue, WA Patricia Johnson Timothy Johnson Emily Johnson 227 E 56th St, New York, NY Stuart Johnson th St, New York, NY NE 5 th St, Bellevue, WA 1502 SE 5 th St Bellevue, WA Main St, Oberlin, OH BA, Oberlin College

How we all fit together ©2014 Inome, Inc. All Rights Reserved. Thank You