CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
GENERIC ENTITY RESOLUTION WITH NEGATIVE RULES Steven Euijong Whang · Omar Benjelloun · Hector Garcia-Molina Compiled by – Darshana Pathak.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
MP3 / MD740 Strategy & Information Systems Oct. 13, 2004 Databases & the Data Asset, Types of Information Systems, Artificial Intelligence.
Designing the Data Warehouse and Data Mart Methodologies and Techniques.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
CS295: Info Quality & Entity Resolution Course Introduction Slides Dmitri V. Kalashnikov University of California, Irvine Spring 2013 Copyright © Dmitri.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
 MODERN DATABASE MANAGEMENT SYSTEMS OVERVIEW BY ENGINEER BILAL AHMAD
1CMSC 345, Version 4/04 Verification and Validation Reference: Software Engineering, Ian Sommerville, 6th edition, Chapter 19.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Instructor: Dr. Sahar Shabanah Fall Lectures ST, 9:30 pm-11:00 pm Text book: M. T. Goodrich and R. Tamassia, “Data Structures and Algorithms in.
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Verification and Validation.
Advanced Topics in Requirement Engineering. Requirements Elicitation Elicit means to gather, acquire, extract, and obtain, etc. Requirements elicitation.
Microsoft Access DataBase Automated Grading System
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Data Structures & Algorithms and The Internet: A different way of thinking.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
© 2007 by Prentice Hall 1 Introduction to databases.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
1 Introduction to Software Engineering Lecture 1.
Presenter: Shanshan Lu 03/04/2010
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
ITGS Databases.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
7 Strategies for Extracting, Transforming, and Loading.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Data Mining and Decision Support
Library Online Resource Analysis (LORA) System Introduction Electronic information resources and databases have become an essential part of library collections.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Facilitating Document Annotation Using Content and Querying Value.
Database Design, Application Development, and Administration, 6 th Edition Copyright © 2015 by Michael V. Mannino. All rights reserved. Chapter 5 Understanding.
Data Mining What is to be done before we get to Data Mining?
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Federated & Meta Search
Lecture 9: Entity Resolution
Computer Science Department University of California, Irvine
Semantic Interoperability and Data Warehouse Design
Objective of This Course
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
CS246: Information Retrieval
CS639: Data Management for Data Science
Presentation transcript:

CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright © Dmitri V. Kalashnikov, 2010

2 Class Organizational Issues Class Webpage – –Will put these slides there Rescheduling of Class Time –Now –Tue Thr 3:30-4:50 ICS209 –Twice a week –New –Thr 3:00-5:20 ? –10 min break in the middle –Once a week –Easier on students –Is this time slot OK?

3 Class Structure Student Presentation-Based Class –Students will present publications –Papers cover recent trends (not comprehensive) –Prepare slides –Slides will be collected after presentation –Discussion Final grade –Quality of slides –Quality of presentations –Participation & attendance –We are a small size class, so please attend! –No exams!

4 Tentative Syllabus

5 Presentation Topics –All papers are split into “topics” –Covering a topic on a day Student Presentations –Each student will choose a day/topic to present –A student presents only during 1 day in quarter! –If it is possible –To reduce workload –A student will present 2(?) papers on that day –Please start preparing early!

6 Tentative List of Publications

7 How to present a paper Present “high-level” idea –Main idea of the paper [should be clear] Present technical depth of techniques 1)Cover techniques in detail 2)Try to analyze the paper [if you can] –Discuss what you like about the paper –Criticize the technique –Do you see flaws/weaknesses in the proposed methodology? –Do you think the techniques can be improved? –Do you think authors should have included additional info/algo? Present experiments –Explain datasets/setup/graphs (analyze results) –Criticize experiments [if you can] –Large enough data? More experiments needed? Unexplained trends? etc

8 Who wants to present first?

9 Talk Overview Class Organizational issues   Intro to Data Quality & Entity Resolution

10 Data Processing Flow Data  Organizations & People collect large amounts of data  Many types of data –Textual –Semi structured –Multimodal Data Analysis Decisions Analysis  Data is analyzed for a variety of purposes –Automated analysis: Data Mining –Human in the loop: OLAP –Ad hoc  Analysis for Decision Making –Business Decision Making –Etc

11 Quality of decisions depends on quality of data Quality of data is critical $1 Billion market –Estimated by Forrester Group Data Quality –Very old research area –But no comprehensive textbook exists yet! Quality of Data Quality of Analysis Quality of Decisions

12 Example of Analysis on Bad Data: CiteSeer CiteSeer: Top-k most cited authors DBLPDBLP Unexpected Entries Unexpected Entries –Lets check two people in DBLP –“A. Gupta” –“L. Zhang” Analysis on bad data can lead to incorrect results. Fix errors before analysis. More than 80% of researchers working on data mining projects spend more than 40% of their project time on cleaning and preparation of data.

13 *Why* Data Quality Issues Arise? Types of DQ Problems –Ambiguity –Uncertainty –Erroneous data values –Missing Values –Duplication –etc

14 Example of Ambiguity –Ambiguity –Categorical data –“Location: Washington” –D.C.? State? Something else?

15 Example of Uncertainty –Uncertainty –Numeric data –“John’s salary is between $50K and $80K” –Query: find all people with salary > $70K

16 Example of Erroneous Data –Erroneous data values –

17 Example of Missing Values –Missing Values –

18 Example of Duplication –Duplication – –Same? Different?

19 Inherent Problems vs Errors in Preprocessing –Inherent Problems with Data –The dataset contains errors (like in prev. slide) – –Errors in Preprocessing –The dataset might not contain errors –But preprocessing algo (e.g., extraction) fails –Text: “John Smith lives in Irvine CA at 100 main st, his salary is $25K” –Extractor: <Person: Irvine, Location: CA, Salary $100K, Address: null>

20 *When* Data Quality Issues Arise? Past. –Manual entering of data –Prime reason in the past [Winkler, etc] –People make mistakes while typing in info –E.g. Census data –Tools to prevent entering bad data in the fist place –E.g., field redundancy –Trying hard to avoid the problem altogether –Sometimes it is not possible –E.g. missing data in census –Tools to detect problems with data –If-then-else and other types of rules to fix problems –Applied by people inconsistently –Even though strict guidelines existed –Asking people to fix problems is not always a good idea!

21 *When* DQ Issues Arise? Present. –Automated generation of DB content –Prime reason for DQ issues nowadays –Analyzing unstructured or semi-structured raw data –Text / Web –Extraction –Merging DBs or Data sources –Duplicate information –Inconsistent information –Missing data –Inherent problems with well structured data –As in the shown examples

22 Data Flaw wrt Data Quality Raw Data Analysis Decisions Handle Data Quality Two general ways to deal with DQ problems 1)Resolve them and then apply analysis on clean data –Classic Data Quality approach 2)Account for them in the analysis on dirty data –E.g. put data into probabilistic DBMS –Often not considered as DQ

23 Resolve only what is needed! Raw Data Analysis Decisions Handle Data Quality –Data might have many different (types of) problems in it –Solve only those that might impact your analysis –Example Publication DB: –All papers by John Smith –Venues might have errors –The rest is accurate –Task: count papers => Do not fix venues!!!

24 Focus of this class: Entity Resolution (ER)  ER a very common Data Quality challenge  Disambiguating uncertain references to objects  Multiple Variations −Record Linkage [winkler:tr99] −Merge/Purge [hernandez:sigmod95] −De-duplication [ananthakrishna:vldb02,sarawagi:kdd02] −Hardening soft databases [cohen:kdd00] −Reference Matching [mccallum:kdd00] −Object identification [tejada:kdd02] −Identity uncertainty [pasula:nips02, mccallum:iiweb03] −Coreference resolution [ng:acl02] −Fuzzy match and fuzzy grouping −Name Disambiguation [han:jcdl04, li:aaai04] −Reference Disambiguation [km:siam05] −Object Consolidation [mccallum:kdd03wkshp, chen:iqis05] −Reference Reconciliation [dong:sigmod05]  Ironically, some of them are the same (duplication)!

25 Entity Resolution: Lookup and Grouping Lookup –List of all objects is given –Match references to objects Grouping –No list of objects is given –Group references that corefer 25

26 When ER challenge arises? 26  Merging multiple data sources (even structured) –“J. Smith” in DataBase1 –“John Smith” in DataBase2 –Do they co-refer?  References to people/objects/organization in raw data –Who is “J. Smith” mentioned as an author of a publication?  Location ambiguity –“Washington” (D.C.? WA? Other?)  Automated extraction from text –“He’s got his PhD/BS from UCSD and UCLA respectively.” –PhD: UCSD or UCLA?  Natural Language Processing (NLP) − “John met Jim and then he went to school” − “he”: John or Jim?

27 Standard Approach to Entity Resolution –Choosing features to use –For comparing two references –Choosing blocking functions –To avoid comparing all pairs –Choosing similarity function –Outputs how similar are two references –Choosing problem representation –How to represent it internally, e.g. as a graph –Choosing clustering algorithm –How to group references –Choosing quality metric –In experimental work –How to measure the quality of the results

28 Inherent Features: Standard Approach s (u,v) = f (u,v) uv J. Smith John Smith Feature 2 Feature 3 ? ? ? ? “Similarity function”“Feature-based similarity” Deciding if two reference u and v co-refer Analyzing their features (if s(u,v) > t then u and v are declared to co-refer)

Advanced Approach: Information Used + u v uv J. Smith John Smith Feature 2 Feature 3 ? ? ? ? Inherent Features Context Features Entity Relationship Graph (Social Network) Web External Data Dataset Public Datasets E.g., DBLP, IMDB Public Datasets E.g., DBLP, IMDB Encyclopedias E.g., Wikipedia Ontologies E.g., DMOZ Ask a person - Not frequently - Might not work well (Condit.) Functional Dependencies & Consistency constraints

30 Blocking Functions Comparing each reference pair − N >> 1 references in dataset − Each can co-refer with the remaining N-1 − Complexity N(N-1)/2 is too high… Blocking functions –A fast function that finds potential matches quickly BF2 - one lost - one extra BF1 - one extra Ground TruthNaïve for R1

Blocking Functions (contd) Multiple BFs could be used − Better if independent − Use different record fields for blocking Examples − From [Winkler 1984] 1) Fst3 (ZipCode) + Fst4 (NAME) 2) Fst5 (ZipCode) + Fst6 (Street name) 3) 10-digit phone # 4) Fst3(ZipCode) + Fst4(LngstSubstring(NAME)) 5) Fst10(NAME) − BF4 is #1 single − BF1 + BF4is #1 pair − BF1 + BF5 is #2 pair 31

BFs: Other Interpretations Dataset is split into smaller Blocks − Matching operations are performed on Blocks − Block is a clique Blocking − Applying somebody else’s technique first − Not only will find candidates − But also will merge many (even most) cases − Will only leave “tough cases” − Apply your technique on these “tough cases” 32

Basic Similarity Functions uv J. Smith John Smith Feature 2 Feature − How to compare attribute values − Lots of metrics, e.g. Edit Distance, Jaro, Ad hoc − Cross attribute comparisons − How to combine attribute similarities − Many methods, e.g. supervised learning, Ad hoc − How to mix it with other types of evidences − Not only inherent features s (u,v) = f (u,v)

Standardization & Parsing Standardization −Converting attribute values into the same format −For proper comparison Examples − Convert all dates into MM/DD/YYYY format − So that “Jun 1, 2010” matches with “6/01/10” − Convert time into HH:mm:ss format − So that 3:00PM and 15:00 match − Convert Doctor -> Dr.; Professor -> Prof. Parsing −Subdividing into proper fields −“Dr. John Smith” Jr. becomes − 34

Example of Similarity Function Edit Distance (1965) − Comparing two strings − The min number of edits … − Insertions − Deletions − Substitutions − … needed to transform on string into another − Ex.: “Smith” vs. “Smithx” one del is needed. − Dynamic programming solution Ex. of Advanced Version − Assign different costs to ins, del, sub − Some errors are more expensive (unlikely) than others − The distance d(s1,s2) is the min cost transformation − Learn (supervised learning) costs from data 35

Clustering Lots of methods exists − Really a lot! Basic Methods − Hierarchical − Agglomerative − Decide threshold t − if s(u,v) > t then merge(u,v) − Partitioning Advanced Issues − How to decide the number of clusters K − How to handle negative evidence & constarints − Two step clustering & cluster refinement − Etc, very vast area 36

Quality Metrics Purity of clusters − Do clusters contain mixed elements? (~precision) Completeness of clusters − Do clusters contain all of its elements? (~recall) Tradeoff between them − A single metric that combines them (~F-measure) 37

Precision, Recall, and F-measure Assume − You perform an operation to find relevant (“+”) items − E.g. Google “UCI” or some other terms − R is the ground truth set, or the set of relevant entries − A is the answer returned by some algorithm Precision − P = |A ∩ R| / |A| − Which fraction of the answer A are correct (“+”) elements Recall − R = |A ∩ R| / |R| − Which fraction of ground truth elements were found (in A) − − F-measure − F = 2/(1/P + 1/R) harmonic mean of precision and recall 38

Quality Metric: Pairwise F-measure − − Example − R = {a1, a2, a5, a6, a9, a10} − A = {a1, a3, a5, a7, a9} − A ∩ R = {a1, a5, a9} − Pre = |A ∩ R| / |A| = 3/5 − Rec = |A ∩ R| / |R| = 1/2 − − Pairwise F-measure − “+” are pairs that should be merged − “-” are pairs that should not be merged − Now, given an answer, can compute Pre, Rec, F-measure − A widely used metric in ER − But a bad choice in many circumstances! − What is the good choice? 39

40 Web People Search (WePS) Person 1 Person 2 Person 3 Unknown beforehand 2. Top-K Webpages (related to any John Smith) Web domain Very active research area Many problem variations − E.g., context keywords John Smith 1. Query Google with a person name 3. Task: Cluster Webpages (A cluster per person) Person N

41 Recall that… Lookup –List of all objects is given –Match references to objects Grouping –No list of objects is given –Group references that corefer WePS is a grouping task

42 User Interface User Input Results

43 System Architecture Top-K Webpages Person1Person2 Person3 Results Clustering Search Engine Preprocessed Webpages Auxiliary Information Auxiliary Information John Smith Preprocessing - TF/IDF - NE/URL Extraction - ER Graph Postprocessing - Custer Sketches - Cluster Rank - Webpage Rank