Lecture 13: Error Detection

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Estimating the detector coverage in a negative selection algorithm Zhou Ji St. Jude Childrens Research Hospital Dipankar Dasgupta The University of Memphis.
CrowdER - Crowdsourcing Entity Resolution
Supporting Queries with Imprecise Constraints Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati Dept. of Computer.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Unit 1 Accuracy & Precision.  Data (Singular: datum or “a data point”): The information collected in an experiment. Can be numbers (quantitative) or.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Enhancing Interactive Visual Data Analysis by Statistical Functionality Jürgen Platzer VRVis Research Center Vienna, Austria.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #8 Inference Problem - II.
Identify a Health Problem Qualitative Quantitative Develop Program -theory -objectives -format -content Determine Evaluation -design -sampling -measures.
A Test Paradigm for Detecting Changes in Transactional Data Streams Willie Ng and Manoranjan Dash DASFAA 2008.
1 Compacting Test Vector Sets via Strategic Use of Implications Kundan Nepal Electrical Engineering Bucknell University Lewisburg, PA Nuno Alves, Jennifer.
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
Automated QA/QC Technique for Climate Sensor Data EPSCoR Hawaii HGDR Scientific Data Management Portal Development Team.
Why do we need to do it? What are the basic tools?
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Fitting image transformations Prof. Noah Snavely CS1114
WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Grigore Rosu Founder, President and CEO Professor of Computer Science, University of Illinois
Extracting value from grey literature Processes and technologies for aggregating and analysing the hidden Big Data treasure of the organisations.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Update on Retirement Advisory Committee March 17, 2011.
Engineering 1181 College of Engineering Engineering Education Innovation Center Collecting Measured Data Classroom Lecture Slides.
Data Mining What is to be done before we get to Data Mining?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Artificial Intelligence DNA Hypernetworks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University.
2015 JMP Discovery Summit, San Diego
What Is Cluster Analysis?
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Statistics 200 Lecture #5 Tuesday, September 6, 2016
DATA MODELS.
Introduction to Data Mining
Statistics 350 Lecture 4.
Potter’s Wheel: An Interactive Data Cleaning System
runtime verification Brief Overview Grigore Rosu
Daniel Amyot and Jun Biao Yan
Jiawei Han Department of Computer Science
Finding Factors and Multiples
How to Learn Your Client
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Data and Applications Security Developments and Directions
Lecture 16: Probabilistic Databases
Lecture 12: Data Wrangling
Lecture 14: Data Repairing
I don’t need a title slide for a lecture
Data and Applications Security Developments and Directions
Expandable Group Identification in Spreadsheets
Welcome to the Kernel-Club
Lecture 15: Data Cleaning for ML
Data and Applications Security Developments and Directions
Data and Applications Security Developments and Directions
Data and Applications Security Developments and Directions
BPaaS Evaluation Environment Research Prototype
BPaaS Evaluation Research Prototype
Data and Applications Security Developments and Directions
AGENDA Introductions Review Goal, 3 strategies, Case for Change
Detecting Data Errors: Where are we and what needs to be done?
RE for Data Cleaning with Machine Learning
Presentation transcript:

Lecture 13: Error Detection

Today’s Agenda Data Errors and Detection Qualitative Error Detection Combining Error Detectors

1. Data Errors and Detection Section 1 1. Data Errors and Detection

Section 1 What is a Data Error?

Section 1 What is a Data Error?

Error Detection Strategies Section 1 Error Detection Strategies Rule-based detection algorithms Constraint violations, FDs, CFDs, Denial Constraints Pattern verification and enforcement Syntactic patterns (date formatting) Semantic patterns (location names WI) Quantitative methods Statistical outliers Deduplication

Section 1 Variety of tools

2. Qualitative Error Detection Section 2 2. Qualitative Error Detection

Error Detection Taxonomy Section 2 Error Detection Taxonomy

FDs and CFDs Functional dependency (FD): Section 2 FDs and CFDs Functional dependency (FD): Conditional Functional Dependency (CFD): A functional dependency on a subset of the data

Matching Dependencies (MDs) Section 2 Matching Dependencies (MDs)

Denial Constraints (DCs) Section 2 Denial Constraints (DCs)

Denial Constraints (DCs) Section 2 Denial Constraints (DCs)

Constraints and Detection Section 2 Constraints and Detection Hypergraph-based approach: Each cell in the DB is a vertex, each set of tuples violating a constraint form a hyperedge

Constraints and Detection Section 2 Constraints and Detection Hypergraph-based approach: Each cell in the DB is a vertex, each set of tuples violating a constraint form a hyperedge

Constraints and Detection Section 2 Constraints and Detection Hypergraph-based approach: Each cell in the DB is a vertex, each set of tuples violating a constraint form a hyperedge

Error detection engine Section 2 Error detection engine

3. Combining Error Detectors Section 3 3. Combining Error Detectors

Section 3 Lots of Detectors

Combining Tools Naïve: A least k tools agree on a value to be an error Section 3 Combining Tools Naïve: A least k tools agree on a value to be an error Introduces precision recall tradeoff Ordered: Apply tools as a chain Run all tools on samples Pick the tool with the highest precision Apply and verify the results Update prevision and recall of other tools Repeat

What’s next We need real ensembles for error detectors Section 3 What’s next We need real ensembles for error detectors Discovery of integrity constraints is challenging Mining is not robust to noise Data exploration and metadata discovery is needed