A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 2
Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 3
Example Report of Bird Sightings ObserverBird-IDBird-NameProb MaryBird-1Finch0.8 MaryBird-1Toucan0.2 SusanBird-1Nightingale0.7 SusanBird-1Toucan0.3 Another Bird-1Hummingbird0.65 Another Bird-1Toucan0.35 ObserverBird-IDBird-Name MaryBird-1Finch SusanBird-1Nightingale Another Bird-1hummingbird Cleaning 4
Philosophy Data Cleaning – To remove dirty data Uncertain Data Management – To preserve more information 5
Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 6
Data Quality Issues Multi-SourceSingle Source Schema Level Instance Level Inconsistency 010Shanghai 021Beijing Constraint Dirty Data 7
Data Quality Issues Multi-SourceSingle Source Schema Level Instance Level Sensor Network o Temperature Census Data o Birth Year Inconsistency Missing Values, Outliers 8
Data Quality Issues Multi-SourceSingle Source Schema Level Instance Level Inconsistency Missing Values, Outliers Integration Duplication 9
Single Source & Schema Level Inconsistent Repairs – Example – Solutions To Optimize some Objective Function – Minimize the number of changes – Cost Function Objective Function Certain Fix Inconsistent Repairs 010Shanghai 021Beijing 10
Single Source & Schema Level Inconsistent Repairs – Example – Solutions Certain Fix (VLDB’10) – Master Data – Certain Region – Some attribute values are asserted to be correct Objective Function Certain Fix Inconsistent Repairs 010Shanghai 021Beijing 11
Single Source & Schema Level Cleaning Operations – Deletion & Insertion – Update attribute values Efficiency Issues – NP-Complete – Heuristic Methods Objective Function Certain Fix Inconsistent Repairs Deletion & Insertion Update Cleaning Operation Efficiency Issues 12
Others Single Source Instance Level – Infer missing values, detect and correct outliers with machine learning / statistical methods Multi-Source Schema Level – Schema Mapping Multi-Source Instance Level – Data Deduplication (Record Linkage) 13
Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 14
Single Source & Schema Level Cardinality-Set-Minimal Repair: A repair I’ of I is cardinality-set- minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’) Objective Function Certain Fix Inconsistent Repairs Deletion & Insertion Update Cleaning Operation Efficiency Issues Possible Repair 15 …
Single Source & Instance Level Missing Value & Outliers – Census Database ERACER (sigmod’10) – User input dependency model Death age Parent age – Learn the parameters – Infer the missing value Infer the missing birth year based on death year & death age distribution Further infer the child’s birth year. – Repeat until the distribution converge 16
Multi-Source Schema Level – Uncertain Schema Matching Instance Level – Possible Repairs in Data Deduplication (VLDB’09) 17
Outline Introduction Traditional Data Quality and Cleaning Uncertainty Management in Traditional Data Cleaning Cleaning Uncertain Database Conclusion 18
Cleaning Uncertain Database Applying Integrity Constraints – Exact Method – Sampling Method Quality of Uncertain Query Results – PWS-Quality Efficiency Issues 19
Integrity Constraints Difference with Traditional Database – Locate error in the original database – Locate error in possible worlds Difficulties – Exponential number of possible worlds Statistical Description – Posterior probabilities Prob[j=7|C] Approaches – Exact Method – Approximate Method 20 NameSSNProb John Bill NameSSN John7 Bill7 Constraint set (C): SSN is Unique
Exact Method (Christoph VLDB’08) Model the Constraints as Assignments. Compress the assignment into a tree structure Calculate the Posterior Probabilities 21 j = 1 j = 7, b = 4 …
Approximate Method (Haiquan Chen ICDE’10 Workshop ) Aggregate Constraints Model the Constraints as Scoring Functions Get Posterior Probability by Sampling 22 EmployeeSalary (k)Confidence Alice Bob Charles Constraints: Total Salary in [50k, 70k] …
Quality of Uncertain Query Results (Reynold VLDB’08) Different Query have Different Properties – Range Query: Independent – Min/Max Query: Otherwise Uniform Metric for all Uncertain Queries – Quality on Possible World Answers Cleaning the uncertain tuple so that to improve query quality as much as possible – Oracle Assumption 23
Efficiency Issue A more “realistic” Oracle – Cleaning may fail – Even a successful cleaning can not remove all false values – Cleaning may involve a cost Objective – Remove as much uncertainty as possible – With limited number of cleaning operations Discussion – Instance Level Cleaning (Clean particular instance) – Schema Level Cleaning (Clean the entire DB) 24
Conclusion Improve Data Quality – Data Cleaning -> Remove Errors – Uncertain Data Management -> Maintain Information 2 Directions 25 Constraints Traditional Database Repairs Consistent Possible World(s) Uncertain Database
Discussion Thank You :) 26