© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic of an object –Examples: eye color of a person, temperature, etc. –Attribute is also known as variable, field, characteristic, or feature l A collection of attributes describe an object –Object is also known as record, point, case, sample, entity, or instance Attributes Objects
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Types of Attributes l There are different types of attributes –Nominal Examples: ID numbers, eye color, zip codes –Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} –Interval Examples: calendar dates, temperatures in Celsius or Fahrenheit. –Ratio Examples: monetary, currency
Attribute Type DescriptionExamplesOperations NominalThe values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, correlation, 2 test OrdinalThe values of an ordinal attribute provide enough information to order objects. ( ) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation IntervalFor interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests RatioFor ratio variables, both differences and ratios are meaningful. (*, /) monetary quantities, electrical current geometric mean, harmonic mean, percent variation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Discrete and Continuous Attributes l Discrete Attribute –Has only a finite or countably infinite set of values –Examples: zip codes, counts, or the set of words in a collection of documents –Often represented as integer variables. –Note: binary attributes are a special case of discrete attributes l Continuous Attribute –Has real numbers as attribute values –Examples: temperature, height, or weight. –Practically, real values can only be measured and represented using a finite number of digits. –Continuous attributes are typically represented as floating-point variables.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Types of data sets l Record –Data Matrix –Document Data –Transaction Data l Graph –World Wide Web –Molecular Structures l Ordered –Spatial Data –Temporal Data –Sequential Data –Genetic Sequence Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Record Data l Data that consists of a collection of records, each of which consists of a fixed set of attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Matrix l If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space. l Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Document Data l Each document becomes a `term' vector, –each term is a component (attribute) of the vector, –the value of each component is the number of times the corresponding term occurs in the document.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Transaction Data l A special type of record data, where –each record (transaction) involves a set of items. –For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Graph Data l Examples: Generic graph and HTML Links
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Graph Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Chemical Data l Benzene Molecule: C 6 H 6
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Ordered Data l Sequences of transactions An element of the sequence Items/Events
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Ordered Data l Genomic sequence data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Ordered Data l Genomic sequence data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Ordered Data l Spatio-Temporal Data Average Monthly Temperature of land and ocean
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Preprocessing
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Why Preprocess the Data? l Measures for data quality: –Accuracy: correct or wrong, accurate or not –Completeness: not recorded, unavailable, … –Consistency: some modified but some not … –Believability: how trustable the data are correct? –Interpretability: how easily the data can be understood?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Major Tasks in Data Preprocessing l Data cleaning –Fill in missing values, smooth noisy data l Data Reduction –Sampling –Data Compression l Data transformation and data discretization –Normalization
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Cleaning l Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer error, transmission error –incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., Occupation = “ ” (missing data) –noisy: containing noise, errors, or outliers e.g., Salary = “−10” (an error) –inconsistent: containing discrepancies in codes or names, e.g., Age = “42”, Birthday = “03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records –Intentional (e.g., disguised missing data) Jan. 1 as everyone’s birthday?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Handle Missing Data? l Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably l Fill in the missing value manually: tedious + infeasible? l Fill in it automatically with –a global constant : e.g., “unknown”, a new class?! –the attribute mean –the attribute mean for all samples belonging to the same class: smarter
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Noisy Data l Noise: random error or variance in a measured variable l Incorrect attribute values may be due to – faulty data collection instruments – data entry problems – data transmission problems – technology limitation
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ How to Handle Noisy Data? l Binning –first sort data and partition into (equal-frequency) bins –then one can smooth by bin means, smooth by bin boundaries, etc.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Binning Methods for Data Smoothing Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Reduction: Sampling l Sampling: obtaining a small sample s to represent the whole data set N l Key principle: Choose a representative subset of the data –Simple random sampling may have very poor performance in the presence of skew –Develop adaptive sampling methods, e.g., stratified sampling:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Types of Sampling l Simple random sampling –There is an equal probability of selecting any particular item l Sampling without replacement –Once an object is selected, it is removed from the population l Sampling with replacement –A selected object is not removed from the population l Stratified sampling: –Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data) –Used in conjunction with skewed data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sampling: With or without Replacement SRSWOR (simple random sample without replacement) SRSWR Raw Data
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sampling: Cluster or Stratified Sampling Raw Data Cluster/Stratified Sample
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Sample Size 8000 points 2000 Points500 Points
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Reduction : Data Compression l String compression – There are extensive theories and well-tuned algorithms – Typically lossless, but only limited manipulation is possible without expansion l Audio/video compression – Typically lossy compression, with progressive refinement – Sometimes small fragments of signal can be reconstructed without reconstructing the whole
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Compression Original Data Compressed Data lossless Original Data Approximated lossy
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Transformation l A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values l Methods – Smoothing: Remove noise from data – Normalization: Scaled to fall within a smaller, specified range min-max normalization z-score normalization normalization by decimal scaling
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Normalization l Min-max normalization: to [new_min A, new_max A ] – Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to l Z-score normalization (μ: mean, σ: standard deviation): – Ex. Let μ = 54,000, σ = 16,000. Then l Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Similarity and Dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Similarity and Dissimilarity l Similarity –Numerical measure of how alike two data objects are. –Is higher when objects are more alike. –Often falls in the range [0,1] l Dissimilarity –Numerical measure of how different are two data objects –Lower when objects are more alike –Minimum dissimilarity is often 0 –Upper limit varies l Proximity refers to a similarity or dissimilarity
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Euclidean Distance l Euclidean Distance Where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Euclidean Distance Distance Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Minkowski Distance l Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and p k and q k are, respectively, the kth attributes (components) or data objects p and q.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Minkowski Distance: Examples l r = 1. City block (Manhattan, taxicab, L 1 norm) distance. –A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors l r = 2. Euclidean distance
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Minkowski Distance Euclidean Distance Matrix Manhattan Distance Matrix
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Common Properties of a Distance l Distances, such as the Euclidean distance, have some well known properties. 1.d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2.d(p, q) = d(q, p) for all p and q. (Symmetry) 3.d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. l A distance that satisfies these properties is a metric
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Similarity Between Binary Vectors l Common situation is that objects, i and j, have only binary attributes
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Example
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Cosine Similarity l If d 1 and d 2 are two document vectors, then cos( d 1, d 2 ) = (d 1 d 2 ) / ||d 1 || ||d 2 ||, where indicates vector dot product and || d || is the length of vector d. l Example: d 1 = d 2 = d 1 d 2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d 1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = ||d 2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = cos( d 1, d 2 ) =.3150
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Correlation l Correlation measures the linear relationship between objects
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1.