© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic of an object –Examples: eye color of a person, temperature, etc. –Attribute is also known as variable, field, characteristic, or feature l A collection of attributes describe an object –Object is also known as record, point, case, sample, entity, or instance Attributes Objects

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Types of Attributes l There are different types of attributes –Nominal  Examples: ID numbers, eye color, zip codes –Ordinal  Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} –Interval  Examples: calendar dates, temperatures in Celsius or Fahrenheit. –Ratio  Examples: monetary, currency

Attribute Type DescriptionExamplesOperations NominalThe values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=,  ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, correlation,  2 test OrdinalThe values of an ordinal attribute provide enough information to order objects. ( ) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation IntervalFor interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests RatioFor ratio variables, both differences and ratios are meaningful. (*, /) monetary quantities, electrical current geometric mean, harmonic mean, percent variation

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Discrete and Continuous Attributes l Discrete Attribute –Has only a finite or countably infinite set of values –Examples: zip codes, counts, or the set of words in a collection of documents –Often represented as integer variables. –Note: binary attributes are a special case of discrete attributes l Continuous Attribute –Has real numbers as attribute values –Examples: temperature, height, or weight. –Practically, real values can only be measured and represented using a finite number of digits. –Continuous attributes are typically represented as floating-point variables.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Types of data sets l Record –Data Matrix –Document Data –Transaction Data l Graph –World Wide Web –Molecular Structures l Ordered –Spatial Data –Temporal Data –Sequential Data –Genetic Sequence Data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Data Matrix l If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space. l Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Document Data l Each document becomes a `term' vector, –each term is a component (attribute) of the vector, –the value of each component is the number of times the corresponding term occurs in the document.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Transaction Data l A special type of record data, where –each record (transaction) involves a set of items. –For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 20 Why Preprocess the Data? l Measures for data quality: –Accuracy: correct or wrong, accurate or not –Completeness: not recorded, unavailable, … –Consistency: some modified but some not … –Believability: how trustable the data are correct? –Interpretability: how easily the data can be understood?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 21 Major Tasks in Data Preprocessing l Data cleaning –Fill in missing values, smooth noisy data l Data Reduction –Sampling –Data Compression l Data transformation and data discretization –Normalization

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 22 Data Cleaning l Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error –incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data  e.g., Occupation = “ ” (missing data) –noisy: containing noise, errors, or outliers  e.g., Salary = “−10” (an error) –inconsistent: containing discrepancies in codes or names, e.g.,  Age = “42”, Birthday = “03/07/2010”  Was rating “1, 2, 3”, now rating “A, B, C”  discrepancy between duplicate records –Intentional (e.g., disguised missing data)  Jan. 1 as everyone’s birthday?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 23 How to Handle Missing Data? l Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably l Fill in the missing value manually: tedious + infeasible? l Fill in it automatically with –a global constant : e.g., “unknown”, a new class?! –the attribute mean –the attribute mean for all samples belonging to the same class: smarter

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 25 Noisy Data l Noise: random error or variance in a measured variable l Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 26 How to Handle Noisy Data? l Binning –first sort data and partition into (equal-frequency) bins –then one can smooth by bin means, smooth by bin boundaries, etc.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 27 Binning Methods for Data Smoothing  Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 28 Data Reduction: Sampling l Sampling: obtaining a small sample s to represent the whole data set N l Key principle: Choose a representative subset of the data –Simple random sampling may have very poor performance in the presence of skew –Develop adaptive sampling methods, e.g., stratified sampling:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 29 Types of Sampling l Simple random sampling –There is an equal probability of selecting any particular item l Sampling without replacement –Once an object is selected, it is removed from the population l Sampling with replacement –A selected object is not removed from the population l Stratified sampling: –Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) –Used in conjunction with skewed data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 33 Data Reduction : Data Compression l String compression – There are extensive theories and well-tuned algorithms – Typically lossless, but only limited manipulation is possible without expansion l Audio/video compression – Typically lossy compression, with progressive refinement – Sometimes small fragments of signal can be reconstructed without reconstructing the whole

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 35 Data Transformation l A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values l Methods – Smoothing: Remove noise from data – Normalization: Scaled to fall within a smaller, specified range  min-max normalization  z-score normalization  normalization by decimal scaling

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 36 Normalization l Min-max normalization: to [new_min A, new_max A ] – Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to l Z-score normalization (μ: mean, σ: standard deviation): – Ex. Let μ = 54,000, σ = 16,000. Then l Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Similarity and Dissimilarity l Similarity –Numerical measure of how alike two data objects are. –Is higher when objects are more alike. –Often falls in the range [0,1] l Dissimilarity –Numerical measure of how different are two data objects –Lower when objects are more alike –Minimum dissimilarity is often 0 –Upper limit varies l Proximity refers to a similarity or dissimilarity

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Euclidean Distance l Euclidean Distance Where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Minkowski Distance l Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and p k and q k are, respectively, the kth attributes (components) or data objects p and q.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Minkowski Distance: Examples l r = 1. City block (Manhattan, taxicab, L 1 norm) distance. –A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors l r = 2. Euclidean distance

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Common Properties of a Distance l Distances, such as the Euclidean distance, have some well known properties. 1.d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2.d(p, q) = d(q, p) for all p and q. (Symmetry) 3.d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. l A distance that satisfies these properties is a metric

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Cosine Similarity l If d 1 and d 2 are two document vectors, then cos( d 1, d 2 ) = (d 1  d 2 ) / ||d 1 || ||d 2 ||, where  indicates vector dot product and || d || is the length of vector d. l Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 0 0 0 1 0 2 d 1  d 2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d 1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481 ||d 2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d 1, d 2 ) =.3150

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

Similar presentations

Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

Similar presentations

Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,"— Presentation transcript:

Similar presentations

About project

Feedback