Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey,

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Verification and Validation
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
C6 Databases.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Geographical Information Systems and Science Longley P A, Goodchild M F, Maguire D J, Rhind D W (2001) John Wiley and Sons Ltd 1. Systems, Science and.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
Information Retrieval in Practice
Cluster Analysis.
April 23, 2001LBSC 878 Text Data Mining Douglas W. Oard.
Chapter 3 Databases and Data Warehouses Building Business Intelligence
Geographic Information Systems and Science SECOND EDITION Paul A. Longley, Michael F. Goodchild, David J. Maguire, David W. Rhind © 2005 John Wiley and.
L The Difference Between Logical and Physical Views of Information l Databases and Database Management Systems l How You Can Develop Database Applications.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Chapter 9 Database Design
Overview of Search Engines
© 2003, Prentice-Hall Chapter Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
1 Data Strategy Overview Keith Wilson Session 15.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Database Systems: Design, Implementation, and Management Ninth Edition
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
MICS4 Survey Design Workshop Multiple Indicator Cluster Surveys Survey Design Workshop Data Analysis and Reporting.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
DBS201: DBA/DBMS Lecture 13.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS (Cont’d) Instructor Ms. Arwa Binsaleh.
1. Systems, Science, and Study. Outline What is geographic information? Definition of data, information, knowledge and wisdom Kinds of decisions that.
McGraw-Hill/Irwin Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Databases and Data Warehouses: Supporting the Analytics-Driven.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
Chapter 3 Databases and Data Warehouses: Building Business Intelligence Copyright © 2010 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Chapter 9 Database Systems © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 9 Database Systems. © 2005 Pearson Addison-Wesley. All rights reserved 9-2 Chapter 9: Database Systems 9.1 Database Fundamentals 9.2 The Relational.
Switch off your Mobiles Phones or Change Profile to Silent Mode.
Presented by Ho Wai Shing
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Flat Files Relational Databases
Advanced Database Concepts
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 1 Database Systems.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
5 1 Chapter 5 Normalization of Database Tables Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
N VISUAL ANALYTICS FOR HEALTHCARE: BIG DATA, BIG DECISIONS David Gotz Healthcare Analytics Research Group IBM T.J. Watson Research Center.
Data Mining What is to be done before we get to Data Mining?
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Copyright 2015 John Wiley & Sons, Inc. Project Planning Part II.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
WHO The World Health Survey Data Entry
What Is Cluster Analysis?
More on Clustering in COSC 4335
Lecture 1 Introduction to Database
Data Mining: Data Preparation
GIS I First Principles.
CSE572, CBS572: Data Mining by H. Liu
Topic 5: Cluster Analysis
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

Erroneous Distribution Data Identification Using Outlier Detection Techniques W. Zhuang, Y. Zhang, J.F. Grassle Rutgers, the State University of New Jersey, USA

Overview Review of OBIS DQ-issues Review of OBIS DQ-issues Review of existing DQ methods Review of existing DQ methods Case study: detecting outliers in multidimensional data Case study: detecting outliers in multidimensional data Discussion and future directions Discussion and future directions

Data Quality (DQ) DQ problems can be generated in every steps of the data life cycle:

DQ problems (I) Data gathering: Data gathering: instrument failures; false identifications instrument failures; false identifications geo-referencing geo-referencing Data storage Data storage key metadata missing erroneous data entry; database default values masquerading as real values

DQ problems (II) Data delivery: data corruption due to encoding conversion Data delivery: data corruption due to encoding conversion Data integration: duplicated records Data integration: duplicated records Data retrieval: missing values Data retrieval: missing values Data analysis/cleaning: inappropriate models used, etc. Data analysis/cleaning: inappropriate models used, etc.

DQ solving-a process-based approach DQ solving is an essential component of data analysis and thus part of the data life cycle DQ solving is an essential component of data analysis and thus part of the data life cycle A. It builds foundation for analysis and modeling A. It builds foundation for analysis and modeling B. It provides feedback to improve the whole data life cycle B. It provides feedback to improve the whole data life cycle C. It could lead to more DQ problems if not carefully executed C. It could lead to more DQ problems if not carefully executed

DQ solving methods Harvest metadata close to data Harvest metadata close to data Built-in integrity check and double data entry Built-in integrity check and double data entry Model-based approach: Model-based approach: a) statistical b) heuristic

OBIS DQ Study Metadata-related problems Metadata-related problems DQ on scientific names DQ on scientific names Integrity checking Integrity checking Redundant records detection Redundant records detection Outliers detection- a case study Outliers detection- a case study Outliers sometimes represent erroneous data We are examining data mining tools for detecting erroneous data points

DBSCAN-a clustering tool DBSCAN is density-based in feature space DBSCAN is density-based in feature space It deals with high dimensional data It deals with high dimensional data There is no need to specify cluster numbers There is no need to specify cluster numbers It identifies outliers during the clustering process It identifies outliers during the clustering process It is a fast algorithm and freely available It is a fast algorithm and freely available M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters in large spatial databases M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters in large spatial databases

A diagram of DBSCAN Core Border Outlier  = 1unit MinPts = 5

Total points distribution

Result from DBSCAN

Limitation of the method Geographical outliers may be used to identify erroneous points in survey data, but may not good for museum collections or literature-based data records. Geographical outliers may be used to identify erroneous points in survey data, but may not good for museum collections or literature-based data records. Other methods to identify erroneous distribution data ? How about using environmental data as proxies? Other methods to identify erroneous distribution data ? How about using environmental data as proxies?

Can we get some more information?

Limitations of using environmental variables Risk of imposing a rigid model at the time of pre- processing Risk of imposing a rigid model at the time of pre- processing Risk of losing valuable outliers Risk of losing valuable outliers Risk of circular logic in later analyses Risk of circular logic in later analyses

Discussions Why don’t you use more environmental variables? Why don’t you use more environmental variables? Can you use DBSCAN on environmental variables directly? Can you use DBSCAN on environmental variables directly?

Possible improvements Define multiple methods as DQ components Define multiple methods as DQ components Assign bootstrap weights Assign bootstrap weights Present outlier candidates to experts Present outlier candidates to experts Update weights based on user feedback Update weights based on user feedback

Summary Many data quality problems can arise during the whole data life cycle. Many data quality problems can arise during the whole data life cycle. Preliminary checking can eliminate a lot of simple errors Preliminary checking can eliminate a lot of simple errors Expert knowledge should be integrated and be the decisive factor when it comes to DQ solving Expert knowledge should be integrated and be the decisive factor when it comes to DQ solving Data mining techniques may act as metal detectors so that experts can focus on a narrowed down group of candidates Data mining techniques may act as metal detectors so that experts can focus on a narrowed down group of candidates