Chapter 3 Data Issues. What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes)

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

Data Engineering.
CHAPTER Basic Definitions and Properties  P opulation Characteristics = “Parameters”  S ample Characteristics = “Statistics”  R andom Variables.
Lecture Notes for Chapter 2 Introduction to Data Mining
Population Population
EDU 660 Methods of Educational Research Descriptive Statistics John Wilson Ph.D.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
CSCI 347 / CS 4206: Data Mining Module 02: Input Topic 03: Attribute Characteristics.
Lecture Notes for Chapter 2 Introduction to Data Mining
EECS 800 Research Seminar Mining Biological Data
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Chapter 1: Introduction to Statistics
Data Mining Lecture 2: data.
Data Mining Techniques
Data Presentation.
1 Statistics 202: Statistical Aspects of Data Mining Professor David Mease Tuesday, Thursday 9:00-10:15 AM Terman 156 Lecture 2 = Start chapter 2 Agenda:
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Basics: Data Remark: Discusses “basics concerning data sets (first half of Chapter.
1  Specific number numerical measurement determined by a set of data Example: Twenty-three percent of people polled believed that there are too many polls.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Data.
Slide 1 Copyright © 2004 Pearson Education, Inc..
What is Data? Attributes
CIS527: Data Warehousing, Filtering, and Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining Lecture 2: Data. 2 What is Data? l Collection of data objects and their attributes l Attribute is a property or characteristic of an object.
Chapter 1 Overview and Descriptive Statistics 1111.1 - Populations, Samples and Processes 1111.2 - Pictorial and Tabular Methods in Descriptive.
1 Chapter 2 Data. What is Data? Collection of data objects and their attributes An attribute is a property or characteristic of an object –Examples: eye.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
1 What is Data? l An attribute is a property or characteristic of an object l Examples: eye color of a person, temperature, etc. l Attribute is also known.
1.  The practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring* proportions in a.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
CS 405G: Introduction to Database Systems. 9/29/20162 Review What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting.
Chapter 2 Getting to Know Your Data Yubao (Robert) Wu Georgia State University.
1 Data Mining Lecture 02a: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors)
Data Preliminaries CSC 600: Data Mining Class 1.
Elementary Statistics
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Lecture Notes for Chapter 2 Introduction to Data Mining
STAT 311 Chapter 1 - Overview and Descriptive Statistics
Data Mining Waqas Haider Bangyal.
Lecture Notes for Chapter 2 Introduction to Data Mining
statistics Specific number
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
CISC 4631 Data Mining Lecture 02:
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Basic Statistical Terms
Lecture Notes for Chapter 2 Introduction to Data Mining
statistics Specific number
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Lecture Notes for Chapter 2 Introduction to Data Mining
Group 9 – Data Mining: Data
Data Preliminaries CSC 576: Data Mining.
Lecture 2: Data Representation
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Pre-processing Lecture Notes for Chapter 2
Lecture Notes for Chapter 2 Introduction to Data Mining
Presentation transcript:

Chapter 3 Data Issues

What is a Data Set? Attributes (describe objects) Variable, field, characteristic, feature or observation Objects (have attributes) Record, point, case, sample, entity or item Data Set Collection of objects

Type of an Attributes Depends on the following properties: Distinctness:=,  Order:, = Addition:+, - Multiplication:*, /

Types of Attributes Attribute Type Description ExamplesOperations NominalEach value represents a label. (Typical comparisons between two values are limited to “equal” or “not equal”) Flower color, gender, sip code Mode, entropy, contingency correlation,  2 test OrdinalThe values can be ordered. (Typical comparisons between two values are “equal” or “greater” or “less” Hardness of minerals, (good, better, best), grades, street numbers, rank, age Median, percentiles, rank correlation, run tests, sign tests IntervalThe differences between values are meaningful i.e., a unit of measurement exist. (+, - ) Calendar dates, temperature in Celsius or Fahrenheit Mean, standard deviation, Pearson’s correlation, t anf F test RatioDifferences and ratios are meaningful. (*, /) Monetary quantities, count, age, mass, length, electrical current Geometric mean, harmonic mean, percent variation

Attribute level

Discrete and Continuous Attributes Discrete attributes A discrete attributes has only a finite or countably infinite set of values. Eg. Zip codes, counts or the set of words in a document Discrete attributes are often represented as integer variables Binary attributes are a special case of discrete attributes and assume only two values Eg. Yes/no, true/false, male/female Binary attributes are often represented as Boolean variables, or as integer variables that take on the values 0 or 1 Continuous attributes A continuous attribute has real number values. Eg. Temperature, height or weight (Practically real values can only be measured and represented to finite number of digits) Continuous attributes are typically represented as floating point variables

Structured Data Sets Common types Record Graph Ordered Three important characteristics Dimensionality Sparsity resolution

Record Data Most of the existing data mining work is focused around data sets that consist of a collection of records (data objects), each of which consists of fixed set of data fields (attributes) NameGenderHeightOutput KristinaF1.6 mMedium JimM2 mMedium MaggieF1.9 mTall MarthaF1.88 mTall StephanieF1.7 mMedium BobM1.85 mMedium KathyF1.6 mMedium DaveM1.7 mMedium WorthM2.2 mTall StevenM2.1 mTall DebbieF1.8 mMedium ToddM1.95 mMedium KimF1.9 mTall AmyF1.8 mMedium LynetteF1.75 mMedium

Data Matrix If all objects in data set have the same set of numeric attributes, then each object represents a point (vector) in multi-dimensional space. Each attribute of the object corresponds to a dimension Projection of X load Projection of Y load DistanceLoadThickness

Document Data Each document becomes a ‘term’ vector, where each term is a component (attribute) of the vector, and where the value of each component of the vector is the number of times the corresponding term occurs in the document

Transaction Data Transaction data is a special type of record data, where each record (transaction) involves a set of items. For example, consider a grocery store. The set of product purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are items

Graph Data Data with relationships among objects Likes OO modeling where node represents object and the link representing relationship. Eg. Linked web pages Data with object that are graphs Object contains another objects known as subobjects, e.g., the structure of chemical compounds, where the nodes are atoms and the links between nodes are chemical bonds.

Ordered Data: The attributes have relationships that involve order in time or space. Fig. 2.4 variations of ordered data Sequential data Sequence data Time series data Spatial data

Data Quality The following are some well known issues Noise and outliers Missing values Duplicate data Inconsistent values

Noise and Outliers Typographical errors lead to incorrect values Nominal attribute is misspelled Extra possible value for that attribute Different names for the same thing (eg: Pepsi / Pepsi Cola Noise – modification of original value Random error Non-random error Noise can be Temporal Spatial

Cont’d Signal processing can reduce (generally not eliminate) noise Outliers – small number of points with characteristics different from rest of the data

Missing Values: Eliminate Data Objects with Missing Values Out-of-range entries Negative number in a field that is normally only positive Zero in a field that can never normally be zero The reason: Malfunctional measurement equipment Changes in experimental design during data collection Collection of several similar but not identical datasets Respondents in a survey may refuse to answer certain questions as age or income

Cont’d A simple and effective strategy is to eliminate those records which have missing values. A related strategy is to eliminate attribute that have missing values Drawback: you may end up removing a large number of objects

Missing values: Estimating them Price of the IBM stock changes in a reasonably smooth fashion. The missing values can be estimated by interpolation For a data set that has many similar data points, a nearest neighbor approach can be used to estimate the missing values. If the attribute is continuous, then the average attribute value of the nearest neighbors can be used. While if the attribute is categorical, then the most commonly occurring attribute value can be taken

Missing values: Using the missing value as another value Many data mining approaches can be modified to operate by ignoring missing values. E.g. Clustering – similarity between pairs of data objects need to be calculated. If one or both objects of a pair have missing values for some attributes, then the similarity can be calculated by using only other attributes.

Inconsistent/ Duplicate Data Most machine learning tools will produce different results if data files are duplicated – repetition influence on the result. Errors made at data entry – spelling errors Can corrected manually Data integration Different names of an attribute in different databases Also having duplication