CS639: Data Management for Data Science

Slides:



Advertisements
Similar presentations
Learning Clusterwise Similarity with First-Order Features Aron Culotta and Andrew McCallum University of Massachusetts - Amherst NIPS Workshop on Theoretical.
Advertisements

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
Large-Scale Entity-Based Online Social Network Profile Linkage.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Project 1 Assignment Building a mini-database for CCI in UNCC which includes entity sets: departments (CS,SIS, bioinformatics), faculties, courses given.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
NITISH MANOCHA. Platforms §AIX workstation §OS/390 §Sun Solaris §Windows NT.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Signatures As Threats to Privacy Brian Neil Levine Assistant Professor Dept. of Computer Science UMass Amherst.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tag-based Social Interest Discovery
URI Disambiguation in the Context of Linked Data Afraz Jaffri, Hugh Glaser, Ian MillardECS, University of Southampton
Record Linkage Everything Data CompSci Spring 2014.
Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC
INFORMATION EXTRACTION SNITA SARAWAGI. Management of Information Extraction System Performance Optimization Handling Change Integration of Extracted Information.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Some key resources for Knowledge Services. Scottish Health Libraries Catalogue Shelcat  Search the library catalogues.
Similar Document Retrieval and Analysis in Information Retrieval System based on correlation method for full text indexing.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Chapter 5 Software Maran Illustrated Computers CIS 102.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Web Directory For Computer Science Projects Nidhi Goel Course: CS 491B Instructor: Prof. Chengyu Sun December 8, 2006.
Collective Vision: Using Extremely Large Photograph Collections Mark Lenz CameraNet Seminar University of Wisconsin – Madison February 2, 2010 Acknowledgments:
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Poon Yan Horn Jonathan 09/04/2010 School Of Computing National University of Singapore Student Submissions Integrity Diagnosis (SSID) A User-Centric Plagiarism.
3.4 Internet Strand 3 Sara Liquori. 3.4 Internet  A global computer network providing a variety of information and communication facilities, consisting.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Computer Vision Group Department of Computer Science University of Illinois at Urbana-Champaign.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Digital Archive page 1 Worzyk Anhalt University of Applied Sciences Digital Archive Storage of pictorial material from the Departments of Design and Architecture.
CPT-S Advanced Databases 1 Yinghui Wu EME 49 ADB (ln24)
Distance functions and IE - 3 William W. Cohen CALD.
Best-of-Breed Hybrid Methods for Text De-identification Yang H, Garibaldi JM. Automatic detection of protected health information from clinical narratives.
Introduction to Data Science Lecture 5 Data Integration
Optimizing Parallel Algorithms for All Pairs Similarity Search
Improving searches through community clustering of information
Hierarchical Clustering and Network Topology Identification
Personalized Social Image Recommendation
Call us on (Toll-free). How to Manage the Norton Identity Safe Logins Those Login information that you store in Identity Safe includes.
Lecture 9: Entity Resolution
ISI Web of Knowledge Early updates
Smart Portal To Protect Child Online
Record Linkage with Uniqueness Constraints and Erroneous Values
Selected Topics: External Sorting, Join Algorithms, …
RefWorks Rapid Review for Student Nurses
Disambiguation Algorithm for People Search on the Web
Summarization for entity annotation Contextual summary
CS639: Data Management for Data Science
Statistical Relational AI
1.3 FINDING & LIKING A PAGE 1. On Facebook search bar, type your [Format] – [City, St] and select the page from the drop down menu 2. Type
Information Retrieval and Web Design
Presentation transcript:

CS639: Data Management for Data Science Lecture 22: Entity Resolution [slides from Getoor and Machanavajjhala] Theodoros Rekatsinas

What is Entity Resolution? Problem of identifying and linking/grouping different manifestations of the same real world object. Examples of manifestations and objects: Different ways of addressing (names, email addresses, FaceBook accounts) the same person in text. Web pages with differing descriptions of the same business. Different photos of the same object. … Todo: make these more exciting/precise

Ironically, Entity Resolution has many duplicate names Record linkage Duplicate detection Coreference resolution Reference reconciliation Fuzzy match Object consolidation Object identification Deduplication Entity clustering Approximate match Identity uncertainty Merge/purge Household matching Hardening soft databases Householding Reference matching Doubles

ER Motivating Examples Linking Census Records Public Health Web search Comparison shopping Counter-terrorism Knowledge Graph Construction … Web search – query disambiguation

Motivation: ER and Network Analysis before after

Motivation: ER and Network Analysis Measuring the topology of the internet … using traceroute

IP Aliasing Problem [Willinger et al. 2009]

IP Aliasing Problem [Willinger et al. 2009]

IP Aliasing Problem [Willinger et al. 2009]

Normalization

Matching Features

Examples of matching features

Jaro

Levenshtein

Computing Levenshtein

Set similarity

Cosine similarity and TF/IDF

TF/IDF

Tokening and shingling

Pairwise-ER

Fellegi and Sunter

Supervised ML for pairwise ER

Active learning

Constraints under deduplication

Clustering-based ER

Possible clustering approaches

Correlation clustering

Summary